Skip to content

[Codegen][GPU] Enable swizzling for scaled matmuls#23175

Merged
Muzammiluddin-Syed-ECE merged 22 commits intoiree-org:mainfrom
Muzammiluddin-Syed-ECE:muzasyed/config
Feb 9, 2026
Merged

[Codegen][GPU] Enable swizzling for scaled matmuls#23175
Muzammiluddin-Syed-ECE merged 22 commits intoiree-org:mainfrom
Muzammiluddin-Syed-ECE:muzasyed/config

Conversation

@Muzammiluddin-Syed-ECE
Copy link
Contributor

@Muzammiluddin-Syed-ECE Muzammiluddin-Syed-ECE commented Jan 16, 2026

This is the fourth of a series of PRs that together implement support in
IREE for XOR swizzling through the SwizzleHintOp.

There are four PRs that need to be merged:

  1. Allow rank > 1 swizzle hint op operands and add a pass to flatten
    swizzle hint allocs.
  2. Add patterns which can fold reshapes and extract_slice ops into
    empty ops through swizzle hint ops.
  3. Add swizzle hint attribute to be set in lowering_config and
    consumed in GPUPromoteMatmulOperandsPass.
  4. Update LLVMGPUSelectLoweringStrategy Pass to set xor swizzles for
    MXFP4 GEMMs.

This is PR 4, which does three things:

  • Expresses the row width as a function of CacheLineSizeInBits and the element type of the chosen intrinsic's operands.
  • Adds swizzle attribute to promotion type.
  • Adds a test for the swizzle attribute which should have been added in PR 3.

We see an average 4.8% geomean improvement over top of main in the mxfp4 gemms tested. See full data here.

shape Feature_Throughput (TFLOps) Top_of_main_Throughput(TFLOps) improvement_ratio improvement_percent
1000_1024_8192_512 96.044057 86.408117 1.111517 11.151661
16300_1024_8192_512 1282.544470 1219.842766 1.051401 5.140146
500_16384_26624_1664 713.748860 723.841063 0.986057 -1.394257
53200_53248_8192_512 1194.983741 1107.453688 1.079037 7.903721
8100_16384_8192_512 1388.739043 1301.227179 1.067253 6.725333
8192_512_256_16 305.040291 301.612872 1.011364 1.136364
8192_53248_8192_512 1734.982869 1728.772840 1.003592 0.359216
... ... ... ... ...
OVERALL_GEOMEAN 1.049751 4.975053

Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the speedup after we land this?

@Muzammiluddin-Syed-ECE
Copy link
Contributor Author

Muzammiluddin-Syed-ECE commented Jan 16, 2026

What's the speedup after we land this?

Roughly a Geomean speedup of %4.4 across shapes that previously had bank conflicts. Those that had no bank conflicts to begin with were unchanged.

Muzammiluddin-Syed-ECE added a commit that referenced this pull request Jan 28, 2026
This is part of a series of PR's implementing support for XOR swizzles
in IREE. We require the LDS bank count to figure out XOR swizzle
parameters.

See PR: #23175

---------

Signed-off-by: Muzammiluddin Syed <[email protected]>
@Muzammiluddin-Syed-ECE
Copy link
Contributor Author

What's the speedup after we land this?

Roughly a Geomean speedup of %4.4 across shapes that previously had bank conflicts. Those that had no bank conflicts to begin with were unchanged.

Here's a gist showing the perf against most recent top of main: data.

@kuhar
Copy link
Member

kuhar commented Jan 29, 2026

What's the speedup after we land this?

Roughly a Geomean speedup of %4.4 across shapes that previously had bank conflicts. Those that had no bank conflicts to begin with were unchanged.

Here's a gist showing the perf against most recent top of main: data.

I don't know how to read this data -- there are no units and it's unclear if positive change means improvement or regression.

Do you have data about bank conflicts? (Doesn't have to be exhaustive, just sample a dozen random shapes and see if the bank conflict counters align with the observed delta).

@Muzammiluddin-Syed-ECE
Copy link
Contributor Author

I don't know how to read this data -- there are no units and it's unclear if positive change means improvement or regression.

Do you have data about bank conflicts? (Doesn't have to be exhaustive, just sample a dozen random shapes and see if the bank conflict counters align with the observed delta).

Sorry I shouldve added to the file that the units were throughput so TFLOPs (higher better).

Also here is some data about bank conflicts on top of main.
https://gist.github.com/Muzammiluddin-Syed-ECE/88e8d52c4a9ecdf9f28c824c3388c937 .

Signed-off-by: Muzammiluddin Syed <[email protected]>
Signed-off-by: Muzammiluddin Syed <[email protected]>
Signed-off-by: Muzammiluddin Syed <[email protected]>
Copy link
Contributor

@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicks, but I think this makes sense overall as a step for making this feature happen

Signed-off-by: Muzammiluddin Syed <[email protected]>
Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. One more thing to consider is replacing all these pair<int64_t, int64_t> with a custom struct so that each integer is named

Signed-off-by: Muzammiluddin Syed <[email protected]>
Signed-off-by: Muzammiluddin Syed <[email protected]>
Signed-off-by: Muzammiluddin Syed <[email protected]>
hanhanW pushed a commit that referenced this pull request Feb 6, 2026
This is part of a series of PR's implementing support for XOR swizzles
in IREE. We require the LDS bank count to figure out XOR swizzle
parameters.

See PR: #23175

---------

Signed-off-by: Muzammiluddin Syed <[email protected]>
Signed-off-by: Muzammiluddin Syed <[email protected]>
Signed-off-by: Muzammiluddin Syed <[email protected]>
@Muzammiluddin-Syed-ECE Muzammiluddin-Syed-ECE force-pushed the muzasyed/config branch 2 times, most recently from 6299c66 to 9929999 Compare February 8, 2026 20:29
Signed-off-by: Muzammiluddin Syed <[email protected]>
@Muzammiluddin-Syed-ECE
Copy link
Contributor Author

the clang-tidy warnings are false positives (getXorShuffleAttr is used in ConfigUtils.cpp) or not related to this PR (getNSize and getKSize).

@Muzammiluddin-Syed-ECE Muzammiluddin-Syed-ECE merged commit a9fd31b into iree-org:main Feb 9, 2026
57 checks passed
Muzammiluddin-Syed-ECE added a commit that referenced this pull request Feb 10, 2026
…n CAPI (#23442)

After #23175, we now generate
swizzle hint ops for scaled gemms whose parameters are set during
`lowering_config` selection. These parameters (`rowElems` and
`accessElems`) can be chosen via the tuner too (although in the future
we intend for there to be an analytically derived solution to this).
This PR exposes two functions, `getXorShuffleBounds` and
`isXorShuffleValid`, to allow the tuner to constrain its search space
for applicable XOR swizzles.

Assisted by: composer-1

---------

Signed-off-by: Muzammiluddin Syed <[email protected]>
MaheshRavishankar pushed a commit to MaheshRavishankar/iree that referenced this pull request Feb 24, 2026
…rg#23273)

This is part of a series of PR's implementing support for XOR swizzles
in IREE. We require the LDS bank count to figure out XOR swizzle
parameters.

See PR: iree-org#23175

---------

Signed-off-by: Muzammiluddin Syed <[email protected]>
MaheshRavishankar pushed a commit to MaheshRavishankar/iree that referenced this pull request Feb 24, 2026
This is the fourth of a series of PRs that together implement support in
IREE for XOR swizzling through the SwizzleHintOp.

There are four PRs that need to be merged:
1) Allow rank > 1 swizzle hint op operands and add a pass to flatten
swizzle hint allocs.
2) Add patterns which can fold reshapes and `extract_slice` ops into
empty ops through swizzle hint ops.
3) Add swizzle hint attribute to be set in `lowering_config` and
consumed in `GPUPromoteMatmulOperandsPass`.
4) Update `LLVMGPUSelectLoweringStrategy` Pass to set xor swizzles for
MXFP4 GEMMs.

This is PR 4, which does three things:
- Expresses the row width as a function of CacheLineSizeInBits and the
element type of the chosen intrinsic's operands.
- Adds swizzle attribute to promotion type.
- Adds a test for the swizzle attribute which should have been added in
PR 3.

We see an average 4.8% geomean improvement over top of main in the mxfp4
gemms tested. See [full data
here](https://gist.github.com/Muzammiluddin-Syed-ECE/71c517206d89018f8706d661c94294b6/).

|shape|Feature_Throughput
(TFLOps)|Top_of_main_Throughput(TFLOps)|improvement_ratio|improvement_percent|

|-----|---------------------------|------------------------------|-----------------|-------------------|
|1000_1024_8192_512|96.044057|86.408117|1.111517|11.151661|
|16300_1024_8192_512|1282.544470|1219.842766|1.051401|5.140146|
|500_16384_26624_1664|713.748860|723.841063|0.986057|-1.394257|
|53200_53248_8192_512|1194.983741|1107.453688|1.079037|7.903721|
|8100_16384_8192_512|1388.739043|1301.227179|1.067253|6.725333|
|8192_512_256_16|305.040291|301.612872|1.011364|1.136364|
|8192_53248_8192_512|1734.982869|1728.772840|1.003592|0.359216|
|...|...|...|...|...|
|OVERALL_GEOMEAN|||1.049751|4.975053|

---------

Signed-off-by: Muzammiluddin Syed <[email protected]>
MaheshRavishankar pushed a commit to MaheshRavishankar/iree that referenced this pull request Feb 24, 2026
…n CAPI (iree-org#23442)

After iree-org#23175, we now generate
swizzle hint ops for scaled gemms whose parameters are set during
`lowering_config` selection. These parameters (`rowElems` and
`accessElems`) can be chosen via the tuner too (although in the future
we intend for there to be an analytically derived solution to this).
This PR exposes two functions, `getXorShuffleBounds` and
`isXorShuffleValid`, to allow the tuner to constrain its search space
for applicable XOR swizzles.

Assisted by: composer-1

---------

Signed-off-by: Muzammiluddin Syed <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants