[GPU] Add coalescing to reduction tiling#23673
Conversation
85efe1a to
62c9374
Compare
compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_apply_padding_online_attention.mlir
Outdated
Show resolved
Hide resolved
2f56966 to
d9c4079
Compare
Signed-off-by: Nirvedh Meshram <[email protected]> Co-Authored-By: Claude Sonnet 4 <[email protected]> Signed-off-by: Nirvedh Meshram <[email protected]>
There was no difference at all? Doesn't this cause prefetching to happen where it otherwise wouldn't? Were you testing the default heuristic or using the tuned configurations that were generated from when we had non-coalesced loops? |
By default prefetching is off for direct conv, and i didnt use a tuning spec when doing this comparison, just wanted to show in context of this PR that this transform is not harmful to performance. With tuning I am able to see perf improvement with prefetching when doing coalescing that was not possible without it. |
compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_apply_tiling_level.mlir
Show resolved
Hide resolved
Signed-off-by: Nirvedh Meshram <[email protected]>
d9c4079 to
ffffc63
Compare
This is useful for doing further optimizations like prefetching see details in #23557
Regrading the implementation, we check for a parent loop because we may do reduction tiling multiple times e.g in direct convolution we tile the filter reduction dims, do pack to intrinsics and reshape patterns and then we later tile the channel dimension, so we want to be able to coalesce all these loops in one loop. It is assumed that all loops (parent or tiling) are from tiling of the same root op, see discussion in the PR why we didnt end up adding additional checks to verify this.
I check the effect of this on direct convolution and found no difference in the performance on 336 convolution shapes (excluded any filter 1 shapes as they go down GEMM path) and found no performance difference. I believe the true impact of this can only be seen with tuning.
Fixes : #23557