Reapply "[CPU] Support dynamic attention by tiling K1 when needed." #23544
Reapply "[CPU] Support dynamic attention by tiling K1 when needed." #23544hanhanW merged 2 commits intoiree-org:mainfrom
Conversation
…ree-org#23313) This reverts commit acbfa27. Signed-off-by: hanhanW <[email protected]>
Signed-off-by: hanhanW <[email protected]>
e0a4173 to
4bcd3e2
Compare
|
cc @banach-space @egebeysel it is a pure improvement as it enables the e2e compilation and execution. |
|
#23318 removes the workaround, but there are more things to investigate. So I end up with having this workaround for now. |
|
I don't have context on the cpu side changes, so LGTM because I already approved the attention changes |
Can you help approve the change? Thanks. |
For more context, this is not CPU specific issue. This is the bufferization when you can't vectorize the ops after you convert it to OnlineAttention and lower them to loops like the old issue: #16956 It requires further analysis for bufferization part. We did not fix it in the past, and my other PR improves the analysis. However, I'll need further investigation for the change: #23318 |
The K1 dimension (head_dim) in attention was unconditionally left untiled, which leads large stack allocation when the dimension is dynamic.
K1 is typically small (64/128 per AttentionOpDetail docs), so the original heuristic to leave it untiled was reasonable. The revision sets the tile sizes if the dimension is dynamic or it is not within typical range (<= 128).
E2E tests are added, and they have the same inputs and expected outputs like attention.mlir (which is a static version). Some backends, e.g., AMDGPU, does not support dynamic attention, so we create a new file. The test is enabled on CPU and VMVX backends in the revision.
Fixes #23277
Previously, it triggers a bug on android build. The root cause is that vectorization is not enabled (because masking is not natively supported), and it leads to non-trivial buffer allocation. Although forcing the emulation is not ideal, but it enables the functionality. The performance has not been prioritized for at least two years, so we accept the emulation as a workaround. The evidence is that the e2e tests are not enabled for at least two years for ARM CPU.
iree/tests/e2e/attention/CMakeLists.txt
Lines 1 to 5 in 243fe33