CUDA: experimental native mxfp4 support for blackwell [WIP] #17906

am17an · 2025-12-10T10:59:21Z

Currently WIP, trying to add native fp4 support for blackwell and beyond. To compile -DCMAKE_CUDA_ARCHITECTURES="120a" is required.

Blackwell has a m16n8k64 instruction for 4 bit (mxfp4, nvfp4 and int4) which advertises 2x throughput compared to int8 tensor cores. However at the moment this PR is ~~10% slower than master~~ 25% faster than master on PP. The other issue is that we quantize activation to mxfp4 instead of q8, which lead to failures in test-backend-ops, however PPL tests are okay with this change (though not ruling out correctness issues)

TODO:

Figure out why we don't see better results
Address NMSE error b/w q8_0 and mxfp4

Model	Test	t/s `c6f6e4f`	t/s mxfp4	Speedup
gpt-oss 20B MXFP4 MoE	pp512	10560.64	13304.80	1.26
gpt-oss 20B MXFP4 MoE	pp1024	10659.15	13515.51	1.27
gpt-oss 20B MXFP4 MoE	pp2048	10801.35	13715.10	1.27
gpt-oss 20B MXFP4 MoE	pp4096	10854.04	13806.59	1.27
gpt-oss 20B MXFP4 MoE	pp8192	10688.23	13525.14	1.27
gpt-oss 20B MXFP4 MoE	pp16384	10140.17	11587.72	1.14

on RTX Pro 6000 Blackwell

Model	Test	t/s `c6f6e4f`	t/s mxfp4	Speedup
gpt-oss 120B MXFP4 MoE	pp2048	6663.49	8747.55	1.31
gpt-oss 120B MXFP4 MoE	pp4096	6821.07	8913.47	1.31
gpt-oss 120B MXFP4 MoE	pp8192	7079.22	9162.30	1.29
gpt-oss 120B MXFP4 MoE	pp16384	6936.33	9039.58	1.30

Note: This PR was developed on @JohannesGaessler's server with a 5090 provided by NVIDIA. So thanks to them both!

ggml/src/ggml-cuda/CMakeLists.txt

easyfab · 2025-12-10T18:11:34Z

Nice speedup ,

Master:

Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	0	pp512	5614.78 ± 40.21
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	0	pp2048	4729.89 ± 10.28
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	0	tg128	204.28 ± 0.53
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	pp512	6460.61 ± 65.01
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	pp2048	6624.29 ± 24.83
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg128	221.47 ± 0.25

PR:

Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	0	pp512	6473.65 ± 37.97
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	0	pp2048	5346.78 ± 4.23
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	0	tg128	205.29 ± 0.30
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	pp512	7754.67 ± 53.15
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	pp2048	7917.86 ± 20.30
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg128	221.23 ± 0.21

ggml/src/ggml-cuda/common.cuh

JohannesGaessler · 2025-12-10T18:48:53Z

ggml/src/ggml-cuda/common.cuh

+    if (sign > 0.0f) {
+        return static_cast<uint8_t>(best_i);        // 0..7
+    } else {
+        return static_cast<uint8_t>(best_i | 0x8);  // 8..15
+    }


I think it would be slightly more optimal to extract the sign bit from x, do a bit shift, and a logical and.

More generally, there are FP4 conversion intrinsics in the CUDA math API but I'm not sure whether they would be of use.

ggml/src/ggml-cuda/mmq.cu

ggml/src/ggml-cuda/mmq.cuh

JohannesGaessler · 2025-12-10T19:33:14Z

ggml/src/ggml-cuda/mmq.cuh

+        x_qs[i * MMQ_MMA_TILE_X_K_FP4 + k0 + 0] = compress(aux_q4[1]) << 16 | compress(aux_q4[0]);
+        x_qs[i * MMQ_MMA_TILE_X_K_FP4 + k0 + 1] = compress(aux_q4[3]) << 16 | compress(aux_q4[2]);
+        x_qs[i * MMQ_MMA_TILE_X_K_FP4 + k0 + 2] = compress(aux_q4[1] >> 4) << 16 | compress(aux_q4[0] >> 4);
+        x_qs[i * MMQ_MMA_TILE_X_K_FP4 + k0 + 3] = compress(aux_q4[3] >> 4) << 16 | compress(aux_q4[2] >> 4);


At this point in the code you should be suffering from a 4-way shared memory bank conflict.

ggml/src/ggml-cuda/mmq.cuh

ggml/src/ggml-cuda/quantize.cu

JohannesGaessler · 2025-12-11T11:25:14Z

ggml/src/ggml-cuda/common.cuh

+        return 0;
+    }
+
+    const uint8_t sign_bit = x < 0.0f ? 0x8 : 0;


I don't know if the compiler is smart enough to do this optimization but I meant to transplant the sign bit directly without the use of conditional statements at all. So cast the float to an unsigned integer, shift 28 bits to the right, and apply & 0x8.

ggml/src/ggml-cuda/common.cuh

JohannesGaessler · 2025-12-11T11:30:31Z

ggml/src/ggml-cuda/mmq.cuh

 }

 #define MMQ_MMA_TILE_X_K_Q8_0 (2*MMQ_TILE_NE_K + 2*MMQ_TILE_NE_K/QI8_0                   + 4)
+#define MMQ_MMA_TILE_X_K_FP4  (MMQ_TILE_NE_K + MMQ_TILE_NE_K / QI8_0)


The resulting value is correct, I just don't think you should be calculating it like this since it will be confusing. It would be better to use something like MMQ_TILE_NE_K + 4 though ideally you would replace the hardcoded value with something that indicates where it comes from.

ggml/src/ggml-cuda/mmq.cuh

JohannesGaessler · 2025-12-11T11:46:44Z

ggml/src/ggml-cuda/quantize.cu

+        const uint8_t q_lo_0 = __shfl_sync(0xFFFFFFFF, q_val, base,      WARP_SIZE);
+        const uint8_t q_lo_1 = __shfl_sync(0xFFFFFFFF, q_val, base + 1,  WARP_SIZE);
+        const uint8_t q_hi_0 = __shfl_sync(0xFFFFFFFF, q_val, base + 16, WARP_SIZE);
+        const uint8_t q_hi_1 = __shfl_sync(0xFFFFFFFF, q_val, base + 17, WARP_SIZE);


This needs a comment to explain the permutation.

I added a comment on top of this function

am17an · 2025-12-12T09:32:32Z

I used 512 as MMQ_ITER_K so that all tile sizes remain the same, and it seems to be faster than the previous version.

mediouni-m · 2025-12-14T06:18:56Z

ggml/src/ggml-cuda/CMakeLists.txt

        # 80     == Ampere, asynchronous data loading, faster tensor core instructions
        # 86     == RTX 3000, needs CUDA v11.1
        # 89     == RTX 4000, needs CUDA v11.8
+        # 100    == Blackwell, needs CUDA v12.8, native FP4 tensor cores


120 please

Blackwell DC uses a different tensor core design which works very differently.

.block_scale mma tensor core ops (non-tcgen05) will not compile on sm_100/103/110

mediouni-m · 2025-12-14T06:21:33Z

ggml/src/ggml-cuda/common.cuh

 #define GGML_CUDA_CC_TURING          750
 #define GGML_CUDA_CC_AMPERE          800
 #define GGML_CUDA_CC_ADA_LOVELACE    890
+#define GGML_CUDA_CC_BLACKWELL       1000


Distinguish Blackwell DC and smaller dies (maybe marking as _120, _GB20X?) more clearly here.

The mma_block_scaled function isn't going to compile on Blackwell with the tcgen05 tensor cores (ie [G]B200/B300 and Thor)

mediouni-m · 2025-12-14T07:59:35Z

To compile -DCMAKE_CUDA_ARCHITECTURES="120a" is required.

120f is most likely what you want - to also cover the DGX Spark (which is sm_121)

am17an requested a review from JohannesGaessler as a code owner December 10, 2025 10:59

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 10, 2025

am17an marked this pull request as draft December 10, 2025 10:59

loci-dev mentioned this pull request Dec 10, 2025

UPSTREAM PR #17906: CUDA: experimental native mxfp4 support for blackwell [WIP] auroralabs-loci/llama.cpp#511

Open

2 tasks

CISC reviewed Dec 10, 2025

View reviewed changes

ggml/src/ggml-cuda/CMakeLists.txt Outdated Show resolved Hide resolved

am17an force-pushed the mxfp4 branch from 16e8a11 to 9dde464 Compare December 10, 2025 15:48

JohannesGaessler reviewed Dec 10, 2025

View reviewed changes

JohannesGaessler reviewed Dec 11, 2025

View reviewed changes

Aman Gupta added 8 commits December 11, 2025 13:52

CUDA: experimental native mxfp4 support for blackwell

e214110

optimize load_tiles

41e876a

optimize quantize_mxfp4

40eb6c7

cleanup

65f944b

first pass review: formatting

a6dcaa5

use interleaved layout for mma

b7deb96

mmq: add assert for size

928cc55

use __nv_fp4x4_e2m1

a1672f6

am17an force-pushed the mxfp4 branch from b978da3 to a1672f6 Compare December 11, 2025 15:41

use iter_k as 512, cleanup

61c41a0

am17an force-pushed the mxfp4 branch from 870c9d7 to 61c41a0 Compare December 12, 2025 10:02

mediouni-m reviewed Dec 14, 2025

View reviewed changes

CUDA: experimental native mxfp4 support for blackwell [WIP] #17906

Are you sure you want to change the base?

CUDA: experimental native mxfp4 support for blackwell [WIP] #17906

Conversation

am17an commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

easyfab commented Dec 10, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

am17an commented Dec 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mediouni-m commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

am17an commented Dec 10, 2025 •

edited

Loading