Interactive step-by-step animations explaining how vLLM and modern LLM inference engines work under the hood.
How vLLM divides GPU memory into fixed-size physical blocks, maps logical block tables per request, and eliminates KV cache fragmentation — token by token.
Explore Execution PhasesWhy LLM inference has two fundamentally different phases: a compute-bound prefill that processes all prompt tokens at once, and a memory-bandwidth-bound decode that generates tokens one at a time.
Explore SchedulingHow iteration-level scheduling eliminates the GPU idle time of static batching by slotting new requests in the moment another finishes — instead of waiting for an entire batch to complete.
Explore Memory PressureWhat happens when the KV cache fills up mid-batch: how vLLM evicts a request's blocks, swaps them to CPU memory, and later restores them to resume generation without corrupting other requests.
Explore FundamentalsWhy transformers cache key and value vectors during autoregressive generation, how the cache grows with sequence length, and what it costs in memory — the foundation behind all vLLM optimizations.
Explore Memory SharingHow N independent completions from the same prompt share a single copy of the prompt's KV cache using reference counting — reducing memory proportional to prompt length × N.
Explore Memory SharingHow beam search maintains K candidate sequences, shares prompt and common-prefix KV blocks as a tree, and immediately frees memory when low-scoring beams are pruned.
Explore ComputationHow the attention dot-product is computed across non-contiguous physical blocks: iterate over logical blocks to gather scores, apply one softmax, then accumulate the weighted value sum — identical math to contiguous attention.
Explore Distributed SystemsHow vLLM distributes model weights across multiple GPUs, partitions the KV cache by attention head, uses a single block manager on the master node, and synchronizes intermediate activations with all-reduce operations.
Explore ProblemWhat goes wrong without PagedAttention: internal fragmentation wastes up to 44% of GPU memory through over-reservation, while external fragmentation blocks new requests despite free space existing.
Explore