LLM Inference Visualizations

Interactive step-by-step animations explaining how vLLM and modern LLM inference engines work under the hood.

Memory Management

PagedAttention

How vLLM divides GPU memory into fixed-size physical blocks, maps logical block tables per request, and eliminates KV cache fragmentation — token by token.

Explore
Execution Phases

Prefill vs. Decode

Why LLM inference has two fundamentally different phases: a compute-bound prefill that processes all prompt tokens at once, and a memory-bandwidth-bound decode that generates tokens one at a time.

Explore
Scheduling

Continuous Batching

How iteration-level scheduling eliminates the GPU idle time of static batching by slotting new requests in the moment another finishes — instead of waiting for an entire batch to complete.

Explore
Memory Pressure

KV Cache Preemption

What happens when the KV cache fills up mid-batch: how vLLM evicts a request's blocks, swaps them to CPU memory, and later restores them to resume generation without corrupting other requests.

Explore
Fundamentals

KV Cache

Why transformers cache key and value vectors during autoregressive generation, how the cache grows with sequence length, and what it costs in memory — the foundation behind all vLLM optimizations.

Explore
Memory Sharing

Parallel Sampling

How N independent completions from the same prompt share a single copy of the prompt's KV cache using reference counting — reducing memory proportional to prompt length × N.

Explore
Memory Sharing

Beam Search

How beam search maintains K candidate sequences, shares prompt and common-prefix KV blocks as a tree, and immediately frees memory when low-scoring beams are pruned.

Explore
Computation

Paged Attention — Block-Wise Computation

How the attention dot-product is computed across non-contiguous physical blocks: iterate over logical blocks to gather scores, apply one softmax, then accumulate the weighted value sum — identical math to contiguous attention.

Explore
Distributed Systems

Tensor Parallelism

How vLLM distributes model weights across multiple GPUs, partitions the KV cache by attention head, uses a single block manager on the master node, and synchronizes intermediate activations with all-reduce operations.

Explore
Problem

Memory Fragmentation

What goes wrong without PagedAttention: internal fragmentation wastes up to 44% of GPU memory through over-reservation, while external fragmentation blocks new requests despite free space existing.

Explore