← Home

KV Cache

Step 1 / 12
Attention(Q, K, V) = softmax( Q · Kᵀ / √dk ) · V
Q
(t × dₖ)
×
Kᵀ
⟵ CACHED
T5: embed × Wₖ → NEW ↑
(dₖ × t)
=
S (scores)
(t × t)
softmax / √dₖ
α (weights)
(t × t)
×
V
⟵ CACHED
T5: embed × Wᵥ → NEW →
(t × dₖ)
=
output
(t × dₖ)
Computation type
Q shape
Score shape
Attention FLOPs (per layer)
Prefill: projecting token embeddings through learned weight matrices
Without KV Cache — O(n² · d) attention
Q(n×d) · Kᵀ(d×n) = S(n×n)
α(n×n) · V(n×d) = O(n×d)
n = 128 tokens
~2M multiply-adds
n = 1,024
~128M multiply-adds
n = 4,096
~2B multiply-adds
n = 32,768
~130B multiply-adds
Score matrix is n×n — grows quadratically. At long context this dominates total compute and must be materialized in memory (Flash Attention addresses this separately).
With KV Cache — O(n · d) attention per step
q(1×d) · Kᵀ(d×n) = s(1×n)
α(1×n) · V(n×d) = o(1×d)
n = 128 tokens
~31K multiply-adds
n = 1,024
~250K multiply-adds
n = 4,096
~1M multiply-adds
n = 32,768
~8M multiply-adds
Score is a 1×n vector — memory reads dominate over compute. Bottleneck shifts from arithmetic to memory bandwidth. This is why decode is memory-bound, not compute-bound.