[Codegen][GPU] Enable swizzling for scaled matmuls#23175
[Codegen][GPU] Enable swizzling for scaled matmuls#23175Muzammiluddin-Syed-ECE merged 22 commits intoiree-org:mainfrom
Conversation
b876715 to
40cb00e
Compare
kuhar
left a comment
There was a problem hiding this comment.
What's the speedup after we land this?
compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp
Outdated
Show resolved
Hide resolved
8f86d07 to
ae114d9
Compare
compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp
Outdated
Show resolved
Hide resolved
Roughly a Geomean speedup of %4.4 across shapes that previously had bank conflicts. Those that had no bank conflicts to begin with were unchanged. |
compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp
Outdated
Show resolved
Hide resolved
3e34e2a to
4bf1aa3
Compare
4bf1aa3 to
50dcd56
Compare
This is part of a series of PR's implementing support for XOR swizzles in IREE. We require the LDS bank count to figure out XOR swizzle parameters. See PR: #23175 --------- Signed-off-by: Muzammiluddin Syed <[email protected]>
50dcd56 to
96b35b3
Compare
0a6a045 to
7a36eed
Compare
Here's a gist showing the perf against most recent top of main: data. |
I don't know how to read this data -- there are no units and it's unclear if positive change means improvement or regression. Do you have data about bank conflicts? (Doesn't have to be exhaustive, just sample a dozen random shapes and see if the bank conflict counters align with the observed delta). |
Sorry I shouldve added to the file that the units were throughput so TFLOPs (higher better). Also here is some data about bank conflicts on top of main. |
4066376 to
2c8dc7e
Compare
b5ad9e5 to
affc0a2
Compare
Signed-off-by: Muzammiluddin Syed <[email protected]>
Signed-off-by: Muzammiluddin Syed <[email protected]>
Signed-off-by: Muzammiluddin Syed <[email protected]>
krzysz00
left a comment
There was a problem hiding this comment.
Nitpicks, but I think this makes sense overall as a step for making this feature happen
Signed-off-by: Muzammiluddin Syed <[email protected]>
kuhar
left a comment
There was a problem hiding this comment.
Looks good overall. One more thing to consider is replacing all these pair<int64_t, int64_t> with a custom struct so that each integer is named
Signed-off-by: Muzammiluddin Syed <[email protected]>
Signed-off-by: Muzammiluddin Syed <[email protected]>
Signed-off-by: Muzammiluddin Syed <[email protected]>
This is part of a series of PR's implementing support for XOR swizzles in IREE. We require the LDS bank count to figure out XOR swizzle parameters. See PR: #23175 --------- Signed-off-by: Muzammiluddin Syed <[email protected]>
Signed-off-by: Muzammiluddin Syed <[email protected]>
Signed-off-by: Muzammiluddin Syed <[email protected]>
6299c66 to
9929999
Compare
Signed-off-by: Muzammiluddin Syed <[email protected]>
9929999 to
6219326
Compare
|
the clang-tidy warnings are false positives ( |
…n CAPI (#23442) After #23175, we now generate swizzle hint ops for scaled gemms whose parameters are set during `lowering_config` selection. These parameters (`rowElems` and `accessElems`) can be chosen via the tuner too (although in the future we intend for there to be an analytically derived solution to this). This PR exposes two functions, `getXorShuffleBounds` and `isXorShuffleValid`, to allow the tuner to constrain its search space for applicable XOR swizzles. Assisted by: composer-1 --------- Signed-off-by: Muzammiluddin Syed <[email protected]>
…rg#23273) This is part of a series of PR's implementing support for XOR swizzles in IREE. We require the LDS bank count to figure out XOR swizzle parameters. See PR: iree-org#23175 --------- Signed-off-by: Muzammiluddin Syed <[email protected]>
This is the fourth of a series of PRs that together implement support in IREE for XOR swizzling through the SwizzleHintOp. There are four PRs that need to be merged: 1) Allow rank > 1 swizzle hint op operands and add a pass to flatten swizzle hint allocs. 2) Add patterns which can fold reshapes and `extract_slice` ops into empty ops through swizzle hint ops. 3) Add swizzle hint attribute to be set in `lowering_config` and consumed in `GPUPromoteMatmulOperandsPass`. 4) Update `LLVMGPUSelectLoweringStrategy` Pass to set xor swizzles for MXFP4 GEMMs. This is PR 4, which does three things: - Expresses the row width as a function of CacheLineSizeInBits and the element type of the chosen intrinsic's operands. - Adds swizzle attribute to promotion type. - Adds a test for the swizzle attribute which should have been added in PR 3. We see an average 4.8% geomean improvement over top of main in the mxfp4 gemms tested. See [full data here](https://gist.github.com/Muzammiluddin-Syed-ECE/71c517206d89018f8706d661c94294b6/). |shape|Feature_Throughput (TFLOps)|Top_of_main_Throughput(TFLOps)|improvement_ratio|improvement_percent| |-----|---------------------------|------------------------------|-----------------|-------------------| |1000_1024_8192_512|96.044057|86.408117|1.111517|11.151661| |16300_1024_8192_512|1282.544470|1219.842766|1.051401|5.140146| |500_16384_26624_1664|713.748860|723.841063|0.986057|-1.394257| |53200_53248_8192_512|1194.983741|1107.453688|1.079037|7.903721| |8100_16384_8192_512|1388.739043|1301.227179|1.067253|6.725333| |8192_512_256_16|305.040291|301.612872|1.011364|1.136364| |8192_53248_8192_512|1734.982869|1728.772840|1.003592|0.359216| |...|...|...|...|...| |OVERALL_GEOMEAN|||1.049751|4.975053| --------- Signed-off-by: Muzammiluddin Syed <[email protected]>
…n CAPI (iree-org#23442) After iree-org#23175, we now generate swizzle hint ops for scaled gemms whose parameters are set during `lowering_config` selection. These parameters (`rowElems` and `accessElems`) can be chosen via the tuner too (although in the future we intend for there to be an analytically derived solution to this). This PR exposes two functions, `getXorShuffleBounds` and `isXorShuffleValid`, to allow the tuner to constrain its search space for applicable XOR swizzles. Assisted by: composer-1 --------- Signed-off-by: Muzammiluddin Syed <[email protected]>
This is the fourth of a series of PRs that together implement support in
IREE for XOR swizzling through the SwizzleHintOp.
There are four PRs that need to be merged:
swizzle hint allocs.
extract_sliceops intoempty ops through swizzle hint ops.
lowering_configandconsumed in
GPUPromoteMatmulOperandsPass.LLVMGPUSelectLoweringStrategyPass to set xor swizzles forMXFP4 GEMMs.
This is PR 4, which does three things:
We see an average 4.8% geomean improvement over top of main in the mxfp4 gemms tested. See full data here.