KV Cache — Attention Matrix to Vector

Attention(Q, K, V) = softmax( Q · Kᵀ / √d_k ) · V

(t × dₖ)

Kᵀ

⟵ CACHED

T5: embed × Wₖ → NEW ↑

(dₖ × t)

S (scores)

(t × t)

softmax / √dₖ

α (weights)

(t × t)

⟵ CACHED

T5: embed × Wᵥ → NEW →

(t × dₖ)

output

(t × dₖ)

— Computation type

— Q shape

— Score shape

— Attention FLOPs (per layer)

Prefill: projecting token embeddings through learned weight matrices

Without KV Cache — O(n² · d) attention

Q(n×d) · Kᵀ(d×n) = S(n×n)
α(n×n) · V(n×d) = O(n×d)

n = 128 tokens

~2M multiply-adds

n = 1,024

~128M multiply-adds

n = 4,096

~2B multiply-adds

n = 32,768

~130B multiply-adds

Score matrix is n×n — grows quadratically. At long context this dominates total compute and must be materialized in memory (Flash Attention addresses this separately).

With KV Cache — O(n · d) attention per step

q(1×d) · Kᵀ(d×n) = s(1×n)
α(1×n) · V(n×d) = o(1×d)

n = 128 tokens

~31K multiply-adds

n = 1,024

~250K multiply-adds

n = 4,096

~1M multiply-adds

n = 32,768

~8M multiply-adds

Score is a 1×n vector — memory reads dominate over compute. Bottleneck shifts from arithmetic to memory bandwidth. This is why decode is memory-bound, not compute-bound.