Skip to content

[GPUHeuristics] Refactor MMA heuristic seeds to be architecture-specific#23717

Merged
yzhang93 merged 5 commits intoiree-org:mainfrom
yzhang93:arch-specific-heuristic-seeds
Mar 11, 2026
Merged

[GPUHeuristics] Refactor MMA heuristic seeds to be architecture-specific#23717
yzhang93 merged 5 commits intoiree-org:mainfrom
yzhang93:arch-specific-heuristic-seeds

Conversation

@yzhang93
Copy link
Contributor

Each GPU architecture now defines a constexpr ArchSeedSet containing gemm, scaled gemm, and convolution seed arrays indexed by GemmSize. The architecture is determined from the target and the corresponding seed set is returned (e.g. RDNA4 uses tuned seeds from RX 9070 XT benchmarking, others use the default). GemmSize enum values are changed to start at 0 for direct array indexing.

Add RDNA4 (gfx1201) config tests covering small, medium, and large matmul and convolution shapes.

RDNA4 benchmark results (RX 9070 XT, vs default seeds):

  • GEMM shapes: ~40% of shapes improved, 15.9% geometric mean speedup.
  • Prod Conv shapes: ~25% of shapes improved, 6% geometric mean speedup.
  • Proxy Conv shapes: ~30% of shapes improved, 14.5% geometric mean speedup.

Each GPU architecture now defines a constexpr ArchSeedSet containing
gemm, scaled gemm, and convolution seed arrays indexed by GemmSize.
The architecture is determined from the target and the corresponding
seed set is returned (e.g. RDNA4 uses tuned seeds from RX 9070 XT
benchmarking, others use the default). GemmSize enum values are
changed to start at 0 for direct array indexing.

Add RDNA4 (gfx1201) config tests covering small, medium, and large
matmul and convolution shapes.

RDNA4 benchmark results (RX 9070 XT, vs default seeds):
- GEMM shapes: ~40% of shapes improved, 15.9% geometric mean speedup.
- Prod Conv shapes: ~25% of shapes improved, 6% geometric mean speedup.
- Proxy Conv shapes: ~30% of shapes improved, 14.5% geometric mean speedup.

Co-authored-by: Claude <[email protected]>
Signed-off-by: yzhang93 <[email protected]>
@yzhang93 yzhang93 force-pushed the arch-specific-heuristic-seeds branch from 3b5aca1 to 792645b Compare March 10, 2026 02:30
Copy link
Contributor

@Yu-Zhewen Yu-Zhewen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encoding target-specific features in generic compiler code is an anti-pattern... Is there any way to move it to the rocm target?

},
};

/// RDNA4 seeds (tuned based on RX 9070 XT benchmarking data).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the question is whether this should be arch or SKU-specific. With CDNA3, we saw that different SKUs like MI300X vs. MI308 needed different heuristics. We have that supported for default tuning specs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but it also depends on what we have tried. I think this is something that we will have to evolve and keep in mind to manage the different SKUs. But without measuring we wont know what these should be.

Signed-off-by: yzhang93 <[email protected]>
},
};

/// RDNA4 seeds (tuned based on RX 9070 XT benchmarking data).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but it also depends on what we have tried. I think this is something that we will have to evolve and keep in mind to manage the different SKUs. But without measuring we wont know what these should be.

@yzhang93
Copy link
Contributor Author

yzhang93 commented Mar 11, 2026

Encoding target-specific features in generic compiler code is an anti-pattern... Is there any way to move it to the rocm target?

Good point. Moved ArchSeedSet, the seed tables, and getArchSeedSet() to KnownTargets.cpp/.h. Could you review it again? @kuhar

Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. I think this is a pragmatic compromise -- we can land in the current form and decide how to extend later when we learn what it takes to generalize it to other architectures / SKUs.

Signed-off-by: yzhang93 <[email protected]>
Signed-off-by: yzhang93 <[email protected]>
@yzhang93 yzhang93 merged commit 07e7697 into iree-org:main Mar 11, 2026
83 of 84 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants