[PD] Support KV transfer with MORI-IO #14626
Draft
+1,145
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
MORI-IO is AMD's high-performance, point-to-point communication library that leverages GDR (GPU Direct RDMA) to achieve ultra-low latency and high bandwidth for KVCache transfer in LLM inference. To enable efficient PD (Prefill-Decode) disaggregation on AMD hardware, we adopt MORI-IO transfer engine as the transport layer for SGLang.
Modifications
Architecture Overview
The implementation follows a similar pattern to the mooncake transfer engine integration, with MORI-IO-specific optimizations:
1. MoriKVManager - Core Transfer Management
Initialization:
IOEnginewith RDMA backend configurationConfiguration Options (via environment variables):
SGLANG_MORI_QP_PER_TRANSFER: Number of queue pairs per transfer (default: 1)SGLANG_MORI_POST_BATCH_SIZE: RDMA post batch size (default: -1)SGLANG_MORI_NUM_WORKERS: Number of worker threads (default: 1)2. MoriKVSender (Prefill Side)
batch_writeAPI for KV cache transfer3. MoriKVReceiver (Decode Side)
Usage
Installation: Install MORI-IO library following the MORI installation guide:
SGLang PD Disaggregation with MORI-IO: Use
--disaggregation-transfer-backend morito enable MORI-IO transfer engine:Known Limitations
State data transfer not implemented: Currently, MORI-IO implementation does not support state data transfer for hybrid models (Mamba, SWA, NSA).
Benchmarking and Profiling
End-to-End PD Disaggregation
Hardware Configuration:
Benchmark Command:
Prefill instance (node 1):
Decode instance (node 2):
Router (node 3):
Benchmark client:
Performance Results:
Comparison of STANDALONE (no PD disaggregation) vs MORI vs MOONCAKE backends. Each test was run 3 times and averaged.
Both MORI and MOONCAKE leverage RDMA effectively, with near-identical performance profiles, validating MORI-IO as a production-ready alternative for AMD hardware.
Checklist