[PD] Support KV transfer with MORI-IO #14626

maning00 · 2025-12-08T07:47:47Z

Motivation

MORI-IO is AMD's high-performance, point-to-point communication library that leverages GDR (GPU Direct RDMA) to achieve ultra-low latency and high bandwidth for KVCache transfer in LLM inference. To enable efficient PD (Prefill-Decode) disaggregation on AMD hardware, we adopt MORI-IO transfer engine as the transport layer for SGLang.

Modifications

Architecture Overview

The implementation follows a similar pattern to the mooncake transfer engine integration, with MORI-IO-specific optimizations:

1. MoriKVManager - Core Transfer Management

Initialization:
- Creates IOEngine with RDMA backend configuration
- Registers all GPU memory buffers with the engine
- Spawns dedicated threads for bootstrap or status polling
Configuration Options (via environment variables):
- SGLANG_MORI_QP_PER_TRANSFER: Number of queue pairs per transfer (default: 1)
- SGLANG_MORI_POST_BATCH_SIZE: RDMA post batch size (default: -1)
- SGLANG_MORI_NUM_WORKERS: Number of worker threads (default: 1)

2. MoriKVSender (Prefill Side)

Workflow:
1. Waits for decode instance registration via bootstrap thread
2. Receives transfer metadata (destination memory descriptors, indices) from decode
3. Issues RDMA writes using batch_write API for KV cache transfer
4. Sends auxiliary data via TCP (ZMQ)
5. Monitors transfer status and notifies decode instance upon completion

3. MoriKVReceiver (Decode Side)

Workflow:
1. Registers local engine descriptor and memory descriptors with prefill instance
2. Polls for transfer completion status
3. Receives auxiliary data via dedicated TCP handler
4. Updates request status based on prefill notifications

Usage

Installation: Install MORI-IO library following the MORI installation guide:

cd mori 
pip install -r requirements-build.txt 
git submodule update --init --recursive 
pip3 install .

SGLang PD Disaggregation with MORI-IO: Use --disaggregation-transfer-backend mori to enable MORI-IO transfer engine:

Known Limitations

State data transfer not implemented: Currently, MORI-IO implementation does not support state data transfer for hybrid models (Mamba, SWA, NSA).

Benchmarking and Profiling

End-to-End PD Disaggregation

Hardware Configuration:

GPUs: 8x AMD Instinct MI355X per node
CPUs: 2x AMD EPYC per node
Network: 8x AMD Pensando Pollara 400 AI-NIC per node
Model: DeepSeek-V3 with TP=8
Setup: 3-node configuration (1 prefill + 1 decode + 1 router)

Benchmark Command:

Prefill instance (node 1):

python -m sglang.launch_server \
  --model-path DeepSeek-V3 \
  --disaggregation-mode prefill \
  --host 0.0.0.0 --port 30002 \
  --tp-size 8 \
  --disaggregation-transfer-backend mori \
  --disable-radix-cache \
  --trust-remote-code

Decode instance (node 2):

python -m sglang.launch_server \
  --model-path DeepSeek-V3 \
  --disaggregation-mode decode \
  --host 0.0.0.0 --port 30003 \
  --tp-size 8 \
  --disaggregation-transfer-backend mori \
  --disable-radix-cache \
  --trust-remote-code

Router (node 3):

python -m sglang_router.launch_router \
  --pd-disaggregation --mini-lb \
  --prefill http://node1:30002 \
  --decode http://node2:30003 \
  --host 0.0.0.0 --port 8000

Benchmark client:

python3 -m sglang.bench_serving \
  --backend sglang \
  --base-url http://127.0.0.1:8000 \
  --dataset-name random \
  --num-prompts 1 \
  --random-input <1024/2048/4096/8192> \
  --random-output 16

Performance Results:

Comparison of STANDALONE (no PD disaggregation) vs MORI vs MOONCAKE backends. Each test was run 3 times and averaged.

Input Tokens	Metric	STANDALONE	MORI	MOONCAKE
1024	Throughput (tok/s)	21.91	21.22	21.24
	TTFT Mean (ms)	155.35	168.13	165.71
	TTFT P99 (ms)	155.35	168.13	165.71
2048	Throughput (tok/s)	21.49	20.63	20.78
	TTFT Mean (ms)	162.24	176.57	173.54
	TTFT P99 (ms)	162.24	176.57	173.54
4096	Throughput (tok/s)	21.51	20.68	20.74
	TTFT Mean (ms)	159.93	176.05	174.75
	TTFT P99 (ms)	159.93	176.05	174.75
8192	Throughput (tok/s)	16.93	16.02	16.10
	TTFT Mean (ms)	294.41	326.19	322.70
	TTFT P99 (ms)	294.41	326.19	322.70

Both MORI and MOONCAKE leverage RDMA effectively, with near-identical performance profiles, validating MORI-IO as a production-ready alternative for AMD hardware.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

* add disable notif * send aux with tcp * remove unused log --------- Co-authored-by: cwortman-amd <[email protected]>

gemini-code-assist · 2025-12-08T07:47:50Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

maning00 and others added 11 commits December 3, 2025 16:52

implement MoriKVManager

5f9d062

add batch api

cc0fc35

format code

9f4c847

remove unused

cc333d1

use user specified nic

f1e11f5

fix engine_key collision

d0f67a5

Disable unused notification

82ed526

Use TCP to send AUX (#2)

c751a9b

* add disable notif * send aux with tcp * remove unused log --------- Co-authored-by: cwortman-amd <[email protected]>

add mix tp support

cb5868a

remove unused cond states && improve offset calc

44c5a61

optim _issue_layer_transfers to utilize buffer merge

b819d55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PD] Support KV transfer with MORI-IO #14626

[PD] Support KV transfer with MORI-IO #14626

maning00 commented Dec 8, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[PD] Support KV transfer with MORI-IO #14626

Are you sure you want to change the base?

[PD] Support KV transfer with MORI-IO #14626

Conversation

maning00 commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Architecture Overview

1. MoriKVManager - Core Transfer Management

2. MoriKVSender (Prefill Side)

3. MoriKVReceiver (Decode Side)

Usage

Known Limitations

Benchmarking and Profiling

End-to-End PD Disaggregation

Checklist

Uh oh!

gemini-code-assist bot commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maning00 commented Dec 8, 2025 •

edited

Loading