Skip to content

Conversation

@maning00
Copy link

@maning00 maning00 commented Dec 8, 2025

Motivation

MORI-IO is AMD's high-performance, point-to-point communication library that leverages GDR (GPU Direct RDMA) to achieve ultra-low latency and high bandwidth for KVCache transfer in LLM inference. To enable efficient PD (Prefill-Decode) disaggregation on AMD hardware, we adopt MORI-IO transfer engine as the transport layer for SGLang.

Modifications

Architecture Overview

The implementation follows a similar pattern to the mooncake transfer engine integration, with MORI-IO-specific optimizations:

1. MoriKVManager - Core Transfer Management

  • Initialization:

    • Creates IOEngine with RDMA backend configuration
    • Registers all GPU memory buffers with the engine
    • Spawns dedicated threads for bootstrap or status polling
  • Configuration Options (via environment variables):

    • SGLANG_MORI_QP_PER_TRANSFER: Number of queue pairs per transfer (default: 1)
    • SGLANG_MORI_POST_BATCH_SIZE: RDMA post batch size (default: -1)
    • SGLANG_MORI_NUM_WORKERS: Number of worker threads (default: 1)

2. MoriKVSender (Prefill Side)

  • Workflow:
    1. Waits for decode instance registration via bootstrap thread
    2. Receives transfer metadata (destination memory descriptors, indices) from decode
    3. Issues RDMA writes using batch_write API for KV cache transfer
    4. Sends auxiliary data via TCP (ZMQ)
    5. Monitors transfer status and notifies decode instance upon completion

3. MoriKVReceiver (Decode Side)

  • Workflow:
    1. Registers local engine descriptor and memory descriptors with prefill instance
    2. Polls for transfer completion status
    3. Receives auxiliary data via dedicated TCP handler
    4. Updates request status based on prefill notifications

Usage

Installation: Install MORI-IO library following the MORI installation guide:

cd mori 
pip install -r requirements-build.txt 
git submodule update --init --recursive 
pip3 install .

SGLang PD Disaggregation with MORI-IO: Use --disaggregation-transfer-backend mori to enable MORI-IO transfer engine:

Known Limitations

State data transfer not implemented: Currently, MORI-IO implementation does not support state data transfer for hybrid models (Mamba, SWA, NSA).

Benchmarking and Profiling

End-to-End PD Disaggregation

Hardware Configuration:

  • GPUs: 8x AMD Instinct MI355X per node
  • CPUs: 2x AMD EPYC per node
  • Network: 8x AMD Pensando Pollara 400 AI-NIC per node
  • Model: DeepSeek-V3 with TP=8
  • Setup: 3-node configuration (1 prefill + 1 decode + 1 router)

Benchmark Command:

Prefill instance (node 1):

python -m sglang.launch_server \
  --model-path DeepSeek-V3 \
  --disaggregation-mode prefill \
  --host 0.0.0.0 --port 30002 \
  --tp-size 8 \
  --disaggregation-transfer-backend mori \
  --disable-radix-cache \
  --trust-remote-code

Decode instance (node 2):

python -m sglang.launch_server \
  --model-path DeepSeek-V3 \
  --disaggregation-mode decode \
  --host 0.0.0.0 --port 30003 \
  --tp-size 8 \
  --disaggregation-transfer-backend mori \
  --disable-radix-cache \
  --trust-remote-code

Router (node 3):

python -m sglang_router.launch_router \
  --pd-disaggregation --mini-lb \
  --prefill http://node1:30002 \
  --decode http://node2:30003 \
  --host 0.0.0.0 --port 8000

Benchmark client:

python3 -m sglang.bench_serving \
  --backend sglang \
  --base-url http://127.0.0.1:8000 \
  --dataset-name random \
  --num-prompts 1 \
  --random-input <1024/2048/4096/8192> \
  --random-output 16

Performance Results:

Comparison of STANDALONE (no PD disaggregation) vs MORI vs MOONCAKE backends. Each test was run 3 times and averaged.

Input Tokens Metric STANDALONE MORI MOONCAKE
1024 Throughput (tok/s) 21.91 21.22 21.24
TTFT Mean (ms) 155.35 168.13 165.71
TTFT P99 (ms) 155.35 168.13 165.71
2048 Throughput (tok/s) 21.49 20.63 20.78
TTFT Mean (ms) 162.24 176.57 173.54
TTFT P99 (ms) 162.24 176.57 173.54
4096 Throughput (tok/s) 21.51 20.68 20.74
TTFT Mean (ms) 159.93 176.05 174.75
TTFT P99 (ms) 159.93 176.05 174.75
8192 Throughput (tok/s) 16.93 16.02 16.10
TTFT Mean (ms) 294.41 326.19 322.70
TTFT P99 (ms) 294.41 326.19 322.70

Both MORI and MOONCAKE leverage RDMA effectively, with near-identical performance profiles, validating MORI-IO as a production-ready alternative for AMD hardware.

Checklist

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant