Skip to content
Discussion options

You must be logged in to vote

After divide, the layout is (_16),(_8,_8)):((_8),(_128,_1)), then only 16 threads can get the data? what happen if threadIdx.x >= 16 ?

Please read the predication.md doc in the same docs dir.

If the BLK_M != BLK_N, it can be done this way as well?

yes

Is it a coincidence that both sA and sB can use tC for projections?

no, we are partitioning the inputs for the MMA overlay layout of threads, therefore they both use tC (projections thereof for MK and NK modes)

Replies: 2 comments 10 replies

Comment options

You must be logged in to vote
6 replies
@HarryWu99
Comment options

@HarryWu99
Comment options

@thakkarV
Comment options

@ccecka
Comment options

@HarryWu99
Comment options

Comment options

You must be logged in to vote
4 replies
@HarryWu99
Comment options

@HarryWu99
Comment options

@thakkarV
Comment options

Answer selected by HarryWu99
@HarryWu99
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants