-
|
I am reading the code in https://docs.nvidia.com/cutlass/media/docs/cpp/cute/0x_gemm_tutorial.html I am quite confused what is the meaning of I did some experiments.
there are 256 threads, and the size of smemA is 128x8=1024, then each thread should have 4 data elements. But the shape of tCsA_old is (_8,_1). I know local_partition is about zipped_divide, and after division, it is 4. It seems that there is some data overlap between threads with threadId 0-127 and threads 128-255. tid=1 and tid=128 have the same data? Why?
It seems that every thread has "one row". For |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 10 replies
-
|
@ccecka 🙏 |
Beta Was this translation helpful? Give feedback.
-
|
They are projections of the Thr-Layout. There are images and explanations in the document you linked to: The A-data and B-data need projections of the |
Beta Was this translation helpful? Give feedback.
Please read the predication.md doc in the same docs dir.
yes
no, we are partitioning the inputs for the MMA overlay layout of threads, therefore they both use
tC(projections thereof for MK and NK modes)