← Home

Distributed Execution — Tensor Parallelism

Step 1 / 11
Coordinator
◆ Master Worker
Schedules requests and manages the KV cache block table. Broadcasts block mappings to every GPU worker before each forward pass.
Active
R1 · 4 tokens (T1–T4)
Block Table — logical → physical per GPU
Req
LB
GPU 0
GPU 1
GPU 2
GPU 0 Heads 0–1
IDLE
Free Pool
KV Cache (heads 0–1 only)
GPU 1 Heads 2–3
IDLE
Free Pool
KV Cache (heads 2–3 only)
GPU 2 Heads 4–5
IDLE
Free Pool
KV Cache (heads 4–5 only)
All-Reduce