-
Notifications
You must be signed in to change notification settings - Fork 14.1k
CUDA: experimental native mxfp4 support for blackwell [WIP] #17906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Nice speedup , Master: Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes
PR: Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes
|
ggml/src/ggml-cuda/common.cuh
Outdated
| if (sign > 0.0f) { | ||
| return static_cast<uint8_t>(best_i); // 0..7 | ||
| } else { | ||
| return static_cast<uint8_t>(best_i | 0x8); // 8..15 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be slightly more optimal to extract the sign bit from x, do a bit shift, and a logical and.
More generally, there are FP4 conversion intrinsics in the CUDA math API but I'm not sure whether they would be of use.
ggml/src/ggml-cuda/mmq.cuh
Outdated
| x_qs[i * MMQ_MMA_TILE_X_K_FP4 + k0 + 0] = compress(aux_q4[1]) << 16 | compress(aux_q4[0]); | ||
| x_qs[i * MMQ_MMA_TILE_X_K_FP4 + k0 + 1] = compress(aux_q4[3]) << 16 | compress(aux_q4[2]); | ||
| x_qs[i * MMQ_MMA_TILE_X_K_FP4 + k0 + 2] = compress(aux_q4[1] >> 4) << 16 | compress(aux_q4[0] >> 4); | ||
| x_qs[i * MMQ_MMA_TILE_X_K_FP4 + k0 + 3] = compress(aux_q4[3] >> 4) << 16 | compress(aux_q4[2] >> 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point in the code you should be suffering from a 4-way shared memory bank conflict.
| return 0; | ||
| } | ||
|
|
||
| const uint8_t sign_bit = x < 0.0f ? 0x8 : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if the compiler is smart enough to do this optimization but I meant to transplant the sign bit directly without the use of conditional statements at all. So cast the float to an unsigned integer, shift 28 bits to the right, and apply & 0x8.
ggml/src/ggml-cuda/mmq.cuh
Outdated
| } | ||
|
|
||
| #define MMQ_MMA_TILE_X_K_Q8_0 (2*MMQ_TILE_NE_K + 2*MMQ_TILE_NE_K/QI8_0 + 4) | ||
| #define MMQ_MMA_TILE_X_K_FP4 (MMQ_TILE_NE_K + MMQ_TILE_NE_K / QI8_0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resulting value is correct, I just don't think you should be calculating it like this since it will be confusing. It would be better to use something like MMQ_TILE_NE_K + 4 though ideally you would replace the hardcoded value with something that indicates where it comes from.
| const uint8_t q_lo_0 = __shfl_sync(0xFFFFFFFF, q_val, base, WARP_SIZE); | ||
| const uint8_t q_lo_1 = __shfl_sync(0xFFFFFFFF, q_val, base + 1, WARP_SIZE); | ||
| const uint8_t q_hi_0 = __shfl_sync(0xFFFFFFFF, q_val, base + 16, WARP_SIZE); | ||
| const uint8_t q_hi_1 = __shfl_sync(0xFFFFFFFF, q_val, base + 17, WARP_SIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a comment to explain the permutation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment on top of this function
|
I used 512 as MMQ_ITER_K so that all tile sizes remain the same, and it seems to be faster than the previous version. |
| # 80 == Ampere, asynchronous data loading, faster tensor core instructions | ||
| # 86 == RTX 3000, needs CUDA v11.1 | ||
| # 89 == RTX 4000, needs CUDA v11.8 | ||
| # 100 == Blackwell, needs CUDA v12.8, native FP4 tensor cores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
120 please
Blackwell DC uses a different tensor core design which works very differently.
.block_scale mma tensor core ops (non-tcgen05) will not compile on sm_100/103/110
| #define GGML_CUDA_CC_TURING 750 | ||
| #define GGML_CUDA_CC_AMPERE 800 | ||
| #define GGML_CUDA_CC_ADA_LOVELACE 890 | ||
| #define GGML_CUDA_CC_BLACKWELL 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Distinguish Blackwell DC and smaller dies (maybe marking as _120, _GB20X?) more clearly here.
The mma_block_scaled function isn't going to compile on Blackwell with the tcgen05 tensor cores (ie [G]B200/B300 and Thor)
120f is most likely what you want - to also cover the DGX Spark (which is sm_121) |
Currently WIP, trying to add native fp4 support for blackwell and beyond. To compile
-DCMAKE_CUDA_ARCHITECTURES="120a"is required.Blackwell has a
m16n8k64instruction for 4 bit (mxfp4, nvfp4 and int4) which advertises 2x throughput compared to int8 tensor cores. However at the moment this PR is10% slower than master25% faster than master on PP. The other issue is that we quantize activation to mxfp4 instead of q8, which lead to failures intest-backend-ops, however PPL tests are okay with this change (though not ruling out correctness issues)TODO:
on RTX Pro 6000 Blackwell