Explorations of RDMA in LLM Systems

From Explorations of RDMA in LLM Systems:

Quote

collectives require a fixed “world” of participants. Nodes can’t be added or removed. This is a production nightmare. In disaggregated inference, Prefillers and Decoders need to exchange KvCache. But real production traffic fluctuates — replica count must scale up and down. Machines also fail.

Quote

initializing the collective world is blocking and requires every participant to join. So every time you scale up or down, the whole world must pause.

Quote

collectives guarantee a global ordering semantics. This simplifies application logic, but networks inherently deliver messages out of order. So the library may need extra buffering or synchronization to preserve ordering. The funny part is: some applications don’t even want this guarantee. For example, KvCache transfer — we only care that all pages eventually arrive; the order doesn’t matter.

Quote

collectives require all participants to share the same tensor shape and dtype. This can hurt code ergonomics and sometimes kills performance. E.g., using collectives for RPC forces you to always send the maximum possible message size.

Quote

Most RDMA code uses the RC (Reliable Connection) protocol, which has in-order delivery. EFA, however, uses SRD (Scalable Reliable Datagram) — reliable but unordered.

Quote

IBGDA lets GPUs directly initiate NIC operations — but only ConnectX supports it. Without it, you need CPU mediation to initiate “GPU-side” RDMA

Quote

CPU↔GPU PCIe latency is only ~2 μs.

Quote

SRD is datagram-based: you can send messages directly as long as you know the address. RC requires connection setup.

Quote

we were still relying entirely on TensorRT-LLM. Then we built our own inference engine, routing layer, kernels, networking stack…

Glenn's Digital Garden

Explorer

Explorations of RDMA in LLM Systems

Graph View