Skip to content

[GPU] Add coalescing to reduction tiling#23673

Merged
nirvedhmeshram merged 2 commits intoiree-org:mainfrom
nirvedhmeshram:coalesce_pr
Mar 11, 2026
Merged

[GPU] Add coalescing to reduction tiling#23673
nirvedhmeshram merged 2 commits intoiree-org:mainfrom
nirvedhmeshram:coalesce_pr

Conversation

@nirvedhmeshram
Copy link
Contributor

@nirvedhmeshram nirvedhmeshram commented Mar 5, 2026

This is useful for doing further optimizations like prefetching see details in #23557

Regrading the implementation, we check for a parent loop because we may do reduction tiling multiple times e.g in direct convolution we tile the filter reduction dims, do pack to intrinsics and reshape patterns and then we later tile the channel dimension, so we want to be able to coalesce all these loops in one loop. It is assumed that all loops (parent or tiling) are from tiling of the same root op, see discussion in the PR why we didnt end up adding additional checks to verify this.

I check the effect of this on direct convolution and found no difference in the performance on 336 convolution shapes (excluded any filter 1 shapes as they go down GEMM path) and found no performance difference. I believe the true impact of this can only be seen with tuning.

Fixes : #23557

@nirvedhmeshram nirvedhmeshram marked this pull request as draft March 6, 2026 22:44
Signed-off-by: Nirvedh Meshram <[email protected]>
Co-Authored-By: Claude Sonnet 4 <[email protected]>
Signed-off-by: Nirvedh Meshram <[email protected]>
@nirvedhmeshram nirvedhmeshram marked this pull request as ready for review March 9, 2026 20:34
@Max191
Copy link
Contributor

Max191 commented Mar 10, 2026

I check the effect of this on direct convolution and found no difference in the performance on 336 convolution shapes

There was no difference at all? Doesn't this cause prefetching to happen where it otherwise wouldn't? Were you testing the default heuristic or using the tuned configurations that were generated from when we had non-coalesced loops?

@nirvedhmeshram
Copy link
Contributor Author

I check the effect of this on direct convolution and found no difference in the performance on 336 convolution shapes

There was no difference at all? Doesn't this cause prefetching to happen where it otherwise wouldn't? Were you testing the default heuristic or using the tuned configurations that were generated from when we had non-coalesced loops?

By default prefetching is off for direct conv, and i didnt use a tuning spec when doing this comparison, just wanted to show in context of this PR that this transform is not harmful to performance. With tuning I am able to see perf improvement with prefetching when doing coalescing that was not possible without it.

Signed-off-by: Nirvedh Meshram <[email protected]>
@nirvedhmeshram nirvedhmeshram merged commit bd361f7 into iree-org:main Mar 11, 2026
55 of 57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Need pipling support for direct convolutions

3 participants