LLM Inference Optimization

Techniques like KV Cache, Paged Attention, Flash Attention, Speculative Decoding, and Continuous Batching are what make LLMs fast and scalable in production. Let’s learn all of these techniques one by one.

KV Cache in LLMs

In this blog, we will learn about KV Cache - where K stands for Key and V stands for Value - and why it is used in Large Language Models (LLMs) to speed up text generation.

We will start with how LLMs generate text one token at a time, understand the role of Key, Value, and Query inside the model, see the problem of repeated computation through an example, and then walk through how KV Cache solves this problem by storing and reusing past results.

Read here: KV Cache in LLMs

Paged Attention in LLMs

In this blog, we will learn about Paged Attention, a technique that solves the memory waste problem of KV Cache, allowing LLMs to serve many more users at the same time.

We will start with a quick recap of KV Cache, understand the memory problem it creates, see how traditional memory allocation wastes space through an example, and then walk through how Paged Attention solves this problem by borrowing an idea from how computers manage memory.

Read here: Paged Attention in LLMs

Decoding Flash Attention in LLMs

In this blog, we will learn about Flash Attention by decoding it piece by piece - understanding why standard attention is slow, what makes Flash Attention fast, how it uses GPU memory cleverly, and why it is used in almost every modern Large Language Model (LLM).

We will cover the following:

A quick recap of standard attention
Why standard attention is slow
How GPU memory actually works (HBM vs SRAM)
The core idea behind Flash Attention
Tiling: breaking the work into small blocks
Online softmax: computing softmax without the full matrix
Recomputation in the backward pass
Flash Attention 2
Flash Attention 3
Advantages and impact of Flash Attention

Read here: Decoding Flash Attention in LLMs

Grouped Query Attention

In this blog, we will learn about Grouped-Query Attention (GQA) and how it differs from Multi-Head Attention (MHA).

Today, we will cover the following topics:

Quick Recap: Multi-Head Attention (MHA)
The Problem with Multi-Head Attention
What is Multi-Query Attention (MQA)?
What is Grouped-Query Attention (GQA)?
How Grouped-Query Attention Works
GQA is a Generalization of MHA and MQA
GQA vs MHA vs MQA
Real-World Use Cases
A Note on Terminology
Uptraining: Converting MHA to GQA
Quick Summary

Read here: Grouped Query Attention

A quick note for you

No matter which tech domain you work in, get familiar with these topics:

LLM
RAG
MCP
Agent
Fine-tuning
Quantization

We put it all together in one video:

AI Engineering Explained: LLM, RAG, MCP, Agent, Fine-Tuning, and Quantization

No need to stop reading - bookmark it and watch later when you get time. Future you will thank you.

Now, let's get back to the topic.

Speculative Decoding

In this blog, we will learn about Speculative Decoding - what it is, why LLM generation is slow without it, how a small draft model and a big target model work together to produce tokens faster, the rejection sampling math that guarantees no quality loss, real numbers showing the 2× to 3× speedup, where it is used in production, and the trade-offs to watch out for.

We will cover the following:

What problem does Speculative Decoding solve?
The Big Picture
Why is LLM generation slow?
The core idea behind Speculative Decoding
Step-by-step walkthrough
The verification step
Real numbers and speedup
Where it is used
Trade-offs
Quick Summary

Read here: Speculative Decoding

Continuous Batching in LLMs

In this blog, we will learn about Continuous Batching, a technique that lets LLM servers handle many more users at the same time by keeping the GPU busy at every single step of generation.

We will cover the following:

Quick Recap: How an LLM Generates Tokens
Why Batching Matters for LLMs
The Old Way: Static Batching
The Problem with Static Batching
What is Continuous Batching?
The Ride-Share Analogy
How Continuous Batching Works Step by Step
A Numeric Example
Real Numbers and Speedup
Benefits of Continuous Batching
A Few Important Notes
Quick Summary

Read here: Continuous Batching in LLMs

Prompt Caching

In this blog, we will learn about how Prompt Caching works. We will also see why we need it, how it actually works inside a large language model, and where it is used in real systems like AI assistants and agents.

We will cover the following:

What is a prompt
A quick recap of how an LLM reads a prompt
What is Prompt Caching
Why we need Prompt Caching
The core idea behind Prompt Caching
The exact-prefix rule
Cache write vs cache read and TTL
What we should put in the cache
The benefits of Prompt Caching
Prompt Caching in the real world

Read here: How does Prompt Caching work?

Prefill vs Decode: LLM Inference Optimization

In this blog, we will learn about Prefill vs Decode, the two phases of LLM inference, and how understanding them helps us optimize the speed of an LLM. We will also see how the prefill and decode phases work, how the KV cache connects them, how they differ and when to use which one based on our use case, and how we optimize each phase to make an LLM faster.

We will cover the following:

The two phases: Prefill and Decode
Prefill explained in simple words
Decode explained in simple words
A diagram of the two phases and the KV cache flow
The KV cache as the bridge between the two phases
A step-by-step walkthrough of a few decode steps
Prefill vs Decode comparison table
Why this split matters: compute-bound vs memory-bound
The key metrics: TTFT, TPOT, throughput, and end-to-end latency
Optimization techniques mapped to each phase

Read here: Prefill vs Decode: LLM Inference Optimization

How does vLLM work?

In this blog, we will learn about how vLLM works. We will also see why we need it, how it manages memory so cleverly, and where it is used in the real world to serve large language models to many users at once.

We will cover the following:

What is serving an LLM
A quick recap of prefill, decode, and the KV cache
The problem: the KV cache eats GPU memory
Why naive serving wastes memory
What is vLLM
PagedAttention, the core idea
How PagedAttention shares memory
Continuous batching
The OpenAI-compatible API server
The benefits of vLLM
vLLM in the real world

Read here: How does vLLM work?

How does SGLang work?

In this blog, we will learn about how SGLang works. We will also see what problem it solves, how it makes serving large language models faster, and the clever ideas that make it special.

We will cover the following:

What is SGLang
RadixAttention: the heart of SGLang
How RadixAttention reuses past work
The frontend language of SGLang
Continuous batching in SGLang
Structured output and faster decoding
More powerful features of SGLang
How SGLang compares to vLLM

Read here: How does SGLang work?

Prepare yourself for AI Engineering Interview: AI Engineering Interview Questions

That's it for now.

Thanks

Amit Shekhar
Founder @ Outcome School

You can connect with me on:

Follow Outcome School on:

Read all of our high-quality blogs here.

Subscribe to our newsletter to get our latest AI and Machine Learning blogs straight to your inbox.

KV Cache in LLMs

Paged Attention in LLMs

Decoding Flash Attention in LLMs

Grouped Query Attention

Speculative Decoding

Continuous Batching in LLMs

Prompt Caching

Prefill vs Decode: LLM Inference Optimization

How does vLLM work?

How does SGLang work?

Tags