What is meaning of `Step` in `local_partition` #2509

HarryWu99 · 2025-07-27T16:08:52Z

HarryWu99
Jul 27, 2025

I am reading the code in https://docs.nvidia.com/cutlass/media/docs/cpp/cute/0x_gemm_tutorial.html sgemm_1.cu

I am quite confused what is the meaning of step in local_partition, when generating tCsA and tCsB

Tensor tCsA = local_partition(sA, tC, threadIdx.x, Step<_1, X>{});   // (THR_M,BLK_K)
Tensor tCsB = local_partition(sB, tC, threadIdx.x, Step< X,_1>{});   // (THR_N,BLK_K)

I did some experiments.

// sA_layout: (_128,_8):(_8,_1)
Tensor sA = make_tensor(make_smem_ptr(smemA), sA_layout);
Tensor tCsA_old = local_partition(sA, tC, threadIdx.x);
Tensor tCsAstep11 = local_partition(sA, tC, threadIdx.x, Step<_1, _1>{});
Tensor tCsAstep1X = local_partition(sA, tC, threadIdx.x, Step<_1, X>{});
Tensor tCsAstepX1 = local_partition(sA, tC, threadIdx.x, Step<X, _1>{});

about tCsA_old

there are 256 threads, and the size of smemA is 128x8=1024, then each thread should have 4 data elements. But the shape of tCsA_old is (_8,_1). I know local_partition is about zipped_divide, and after division, it is 4.

// shape ((_16,(_8,_2)),_4)
auto divres3 = zipped_divide(sA_layout, tC);

It seems that there is some data overlap between threads with threadId 0-127 and threads 128-255. tid=1 and tid=128 have the same data? Why?

tid=0 tCsA_old: smem_ptr[32b](0x7f8d00000400) o (_8,_1):(_128,_0)
tid=1 tCsA_old: smem_ptr[32b](0x7f8d00000420) o (_8,_1):(_128,_0)
tid=127 tCsA_old: smem_ptr[32b](0x7f8d000005fc) o (_8,_1):(_128,_0)
tid=128 tCsA_old: smem_ptr[32b](0x7f8d00000420) o (_8,_1):(_128,_0)
tid=255 tCsA_old: smem_ptr[32b](0x7f8d0000061c) o (_8,_1):(_128,_0)

about the use of Step<>

It seems that every thread has "one row". For tCsAstep1X, the tid increases by 1 to move to the next row, and it loops every 16 threads. For tCsAstepX1, the tid needs to increase by 16 to move to the next row. What is the meaning of Step in this function? what happen if I use Step<1, 1> or Step<X,X> or other?

tid=0 tCsAstep1X: smem_ptr[32b](0x7fc600000400) o (_8,_8):(_128,_1)
tCsAstepX1: smem_ptr[32b](0x7fc600000400) o (_8,_8):(_128,_1)
tid=1 tCsAstep1X: smem_ptr[32b](0x7fc600000420) o (_8,_8):(_128,_1)
tCsAstepX1: smem_ptr[32b](0x7fc600000400) o (_8,_8):(_128,_1)
tid=16 tCsAstep1X: smem_ptr[32b](0x7fc600000400) o (_8,_8):(_128,_1)
tCsAstepX1: smem_ptr[32b](0x7fc600000420) o (_8,_8):(_128,_1)
tid=127 tCsAstep1X: smem_ptr[32b](0x7fc6000005e0) o (_8,_8):(_128,_1)
tCsAstepX1: smem_ptr[32b](0x7fc6000004e0) o (_8,_8):(_128,_1)
tid=128 tCsAstep1X: smem_ptr[32b](0x7fc600000400) o (_8,_8):(_128,_1)
tCsAstepX1: smem_ptr[32b](0x7fc600000500) o (_8,_8):(_128,_1)
tid=255 tCsAstep1X: smem_ptr[32b](0x7fc6000005e0) o (_8,_8):(_128,_1)
tCsAstepX1: smem_ptr[32b](0x7fc6000005e0) o (_8,_8):(_128,_1)

Answered by thakkarV

Jul 29, 2025

After divide, the layout is (_16),(_8,_8)):((_8),(_128,_1)), then only 16 threads can get the data? what happen if threadIdx.x >= 16 ?

Please read the predication.md doc in the same docs dir.

If the BLK_M != BLK_N, it can be done this way as well?

yes

Is it a coincidence that both sA and sB can use tC for projections?

no, we are partitioning the inputs for the MMA overlay layout of threads, therefore they both use tC (projections thereof for MK and NK modes)

View full answer

HarryWu99 · 2025-07-28T13:21:13Z

HarryWu99
Jul 28, 2025
Author

@ccecka 🙏

6 replies

HarryWu99 Jul 29, 2025
Author

Thanks! That X means this mode doesn't take effect, right?
Can other numbers, such as 2, be filled in the Step<> besides 1? What does this 2 mean😂

HarryWu99 Jul 30, 2025
Author

@thakkarV 🙏

thakkarV Jul 30, 2025
Collaborator

yes, X is just an alias to cute::Underscore, which means that mode is ignored for the operation. I do not think we have ever used a step > 1, but I guess it could have the same semantic meaning as tiler with strides > 1 (skipping coordinates from the target tensor with that step)

ccecka Jul 30, 2025

That used to be the meaning, but it went unused and was eventually unsupported+dropped in favor of the more general TV-Layout partitioners and Tiler partitioners in other more common interfaces like local_tile and TiledCopy. In general, local_partition is never used anymore.

HarryWu99 Jul 31, 2025
Author

Thank you for your reply! It's very useful to me.

ccecka · 2025-07-28T19:29:39Z

ccecka
Jul 28, 2025

They are projections of the Thr-Layout. There are images and explanations in the document you linked to:

This diagram shows a tC layout, highlights two threads in green and blue, shows the projections of the tC layout, and finally highlights the subtensors within sA, sB, and gC that tCsA, tCsB, and tCgC represent.

With the data partitioned across the threads, every thread can now participate in the compute step by writing

gemm(tCsA, tCsB, tCrC);

because every thread owns different subtensors of the data to be computed.

The A-data and B-data need projections of the tC Thr-Layout so that they have all of the appropriate data to compute their elements within tCrC.

4 replies

HarryWu99 Jul 29, 2025
Author

Thanks! I understand the meaning of tCsA, but why can I get such a layout by doing zipped_divide this way?

Tensor tCsA = local_partition(sA, tC, threadIdx.x, Step<_1, X>{});
// equal to
local_partition(sA, make_layout(make_shape(Int<16>{})), threadIdx.x);

After divide, the layout is (_16),(_8,_8)):((_8),(_128,_1)), then only 16 threads can get the data? what happen if threadIdx.x >= 16 ?

Is it a coincidence that both sA and sB can use tC for projections? If the BLK_M != BLK_N, it can be done this way as well?

HarryWu99 Jul 29, 2025
Author

https://docs.nvidia.com/cutlass/media/docs/cpp/cute/03_tensor.html#inner-and-outer-partitioning

thakkarV Jul 29, 2025
Collaborator

After divide, the layout is (_16),(_8,_8)):((_8),(_128,_1)), then only 16 threads can get the data? what happen if threadIdx.x >= 16 ?

Please read the predication.md doc in the same docs dir.

If the BLK_M != BLK_N, it can be done this way as well?

yes

Is it a coincidence that both sA and sB can use tC for projections?

no, we are partitioning the inputs for the MMA overlay layout of threads, therefore they both use tC (projections thereof for MK and NK modes)

Answer selected by HarryWu99

HarryWu99 Jul 30, 2025
Author

Thanks!

What is meaning of Step in local_partition #2509

Uh oh!

HarryWu99 Jul 27, 2025

Replies: 2 comments · 10 replies

Uh oh!

HarryWu99 Jul 28, 2025 Author

Uh oh!

HarryWu99 Jul 29, 2025 Author

Uh oh!

HarryWu99 Jul 30, 2025 Author

Uh oh!

thakkarV Jul 30, 2025 Collaborator

Uh oh!

Uh oh!

ccecka Jul 30, 2025

Uh oh!

HarryWu99 Jul 31, 2025 Author

Uh oh!

ccecka Jul 28, 2025

Uh oh!

HarryWu99 Jul 29, 2025 Author

Uh oh!

HarryWu99 Jul 29, 2025 Author

Uh oh!

Uh oh!

thakkarV Jul 29, 2025 Collaborator

Uh oh!

HarryWu99 Jul 30, 2025 Author

What is meaning of `Step` in `local_partition` #2509

HarryWu99
Jul 27, 2025

Replies: 2 comments 10 replies

HarryWu99
Jul 28, 2025
Author

HarryWu99 Jul 29, 2025
Author

HarryWu99 Jul 30, 2025
Author

thakkarV Jul 30, 2025
Collaborator

HarryWu99 Jul 31, 2025
Author

ccecka
Jul 28, 2025

HarryWu99 Jul 29, 2025
Author

HarryWu99 Jul 29, 2025
Author

thakkarV Jul 29, 2025
Collaborator

HarryWu99 Jul 30, 2025
Author