Skip to content

Reapply "[CPU] Support dynamic attention by tiling K1 when needed." #23544

Merged
hanhanW merged 2 commits intoiree-org:mainfrom
hanhanW:users/hanhanW/fix-dynamic-attention-2
Feb 23, 2026
Merged

Reapply "[CPU] Support dynamic attention by tiling K1 when needed." #23544
hanhanW merged 2 commits intoiree-org:mainfrom
hanhanW:users/hanhanW/fix-dynamic-attention-2

Conversation

@hanhanW
Copy link
Contributor

@hanhanW hanhanW commented Feb 22, 2026

The K1 dimension (head_dim) in attention was unconditionally left untiled, which leads large stack allocation when the dimension is dynamic.

K1 is typically small (64/128 per AttentionOpDetail docs), so the original heuristic to leave it untiled was reasonable. The revision sets the tile sizes if the dimension is dynamic or it is not within typical range (<= 128).

E2E tests are added, and they have the same inputs and expected outputs like attention.mlir (which is a static version). Some backends, e.g., AMDGPU, does not support dynamic attention, so we create a new file. The test is enabled on CPU and VMVX backends in the revision.

Fixes #23277

Previously, it triggers a bug on android build. The root cause is that vectorization is not enabled (because masking is not natively supported), and it leads to non-trivial buffer allocation. Although forcing the emulation is not ideal, but it enables the functionality. The performance has not been prioritized for at least two years, so we accept the emulation as a workaround. The evidence is that the e2e tests are not enabled for at least two years for ARM CPU.

# TODO: (#17751) Add the arm_64 tests when the bug resolved. See:
# https://github.com/iree-org/iree/actions/runs/10468944505/job/28990909321#step:4:9815
if(IREE_ARCH STREQUAL "arm_64")
return()
endif()

@hanhanW hanhanW requested a review from Groverkss February 22, 2026 13:27
@hanhanW hanhanW requested a review from bjacob as a code owner February 22, 2026 13:27
@hanhanW hanhanW force-pushed the users/hanhanW/fix-dynamic-attention-2 branch from e0a4173 to 4bcd3e2 Compare February 22, 2026 13:28
@hanhanW
Copy link
Contributor Author

hanhanW commented Feb 22, 2026

cc @banach-space @egebeysel it is a pure improvement as it enables the e2e compilation and execution.

@hanhanW
Copy link
Contributor Author

hanhanW commented Feb 22, 2026

#23318 removes the workaround, but there are more things to investigate. So I end up with having this workaround for now.

@Groverkss
Copy link
Contributor

I don't have context on the cpu side changes, so LGTM because I already approved the attention changes

@hanhanW
Copy link
Contributor Author

hanhanW commented Feb 22, 2026

I don't have context on the cpu side changes, so LGTM because I already approved the attention changes

Can you help approve the change? Thanks.

@hanhanW
Copy link
Contributor Author

hanhanW commented Feb 22, 2026

I don't have context on the cpu side changes, so LGTM because I already approved the attention changes

For more context, this is not CPU specific issue. This is the bufferization when you can't vectorize the ops after you convert it to OnlineAttention and lower them to loops like the old issue: #16956

It requires further analysis for bufferization part. We did not fix it in the past, and my other PR improves the analysis. However, I'll need further investigation for the change: #23318

@hanhanW hanhanW merged commit f30a96c into iree-org:main Feb 23, 2026
52 of 57 checks passed
@hanhanW hanhanW deleted the users/hanhanW/fix-dynamic-attention-2 branch February 23, 2026 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Compiler crash in LLVMCPUSelectLoweringStrategy with dynamic-shape iree_linalg_ext.attention

2 participants