From Explorations of RDMA in LLM Systems:
Quote
collectives require a fixed “world” of participants. Nodes can’t be added or removed. This is a production nightmare. In disaggregated inference, Prefillers and Decoders need to exchange KvCache. But real production traffic fluctuates — replica count must scale up and down. Machines also fail.
Quote
initializing the collective world is blocking and requires every participant to join. So every time you scale up or down, the whole world must pause.
Quote
collectives guarantee a global ordering semantics. This simplifies application logic, but networks inherently deliver messages out of order. So the library may need extra buffering or synchronization to preserve ordering. The funny part is: some applications don’t even want this guarantee. For example, KvCache transfer — we only care that all pages eventually arrive; the order doesn’t matter.
Quote
collectives require all participants to share the same tensor shape and dtype. This can hurt code ergonomics and sometimes kills performance. E.g., using collectives for RPC forces you to always send the maximum possible message size.
Quote
Most RDMA code uses the RC (Reliable Connection) protocol, which has in-order delivery. EFA, however, uses SRD (Scalable Reliable Datagram) — reliable but unordered.
Quote
IBGDA lets GPUs directly initiate NIC operations — but only ConnectX supports it. Without it, you need CPU mediation to initiate “GPU-side” RDMA
Quote
CPU↔GPU PCIe latency is only ~2 μs.
Quote
SRD is datagram-based: you can send messages directly as long as you know the address. RC requires connection setup.
Quote
we were still relying entirely on TensorRT-LLM. Then we built our own inference engine, routing layer, kernels, networking stack…