Distributed Execution — Tensor Parallelism

Coordinator

◆ Master Worker

Schedules requests and manages the KV cache block table. Broadcasts block mappings to every GPU worker before each forward pass.

Active

R1 · 4 tokens (T1–T4)

Block Table — logical → physical per GPU

Req

GPU 0

GPU 1

GPU 2

GPU 0 Heads 0–1

IDLE

Free Pool

KV Cache (heads 0–1 only)

GPU 1 Heads 2–3

IDLE

Free Pool

KV Cache (heads 2–3 only)

GPU 2 Heads 4–5

IDLE

Free Pool

KV Cache (heads 4–5 only)

All-Reduce