Without KV Cache — O(n² · d) attention
Q(n×d) · Kᵀ(d×n) = S(n×n)
α(n×n) · V(n×d) = O(n×d)
n = 128 tokens
~2M multiply-adds
n = 1,024
~128M multiply-adds
n = 4,096
~2B multiply-adds
n = 32,768
~130B multiply-adds
Score matrix is n×n — grows quadratically. At long context this dominates total compute and must be materialized in memory (Flash Attention addresses this separately).
With KV Cache — O(n · d) attention per step
q(1×d) · Kᵀ(d×n) = s(1×n)
α(1×n) · V(n×d) = o(1×d)
n = 128 tokens
~31K multiply-adds
n = 1,024
~250K multiply-adds
n = 4,096
~1M multiply-adds
n = 32,768
~8M multiply-adds
Score is a 1×n vector — memory reads dominate over compute. Bottleneck shifts from arithmetic to memory bandwidth. This is why decode is memory-bound, not compute-bound.