[vectorization] Add vectorization of non-projected linalg.generic#23664
[vectorization] Add vectorization of non-projected linalg.generic#23664hanhanW merged 13 commits intoiree-org:mainfrom
Conversation
Signed-off-by: NoumanAmir657 <[email protected]>
Signed-off-by: NoumanAmir657 <[email protected]>
Signed-off-by: NoumanAmir657 <[email protected]>
Signed-off-by: NoumanAmir657 <[email protected]>
hanhanW
left a comment
There was a problem hiding this comment.
How is vector_ext.transfer_gather lowered in CPU backend?
| // RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-generic-vectorization{enable-vector-masking=true}))" --split-input-file %s | FileCheck %s -check-prefix=CHECK-MASK | ||
| // RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-generic-vectorization{fold-cast-into-contract=true}))" --split-input-file %s | FileCheck %s -check-prefix=CHECK-FOLD | ||
| // RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-generic-vectorization{vectorize-to-transfer-gather=true}))" --split-input-file %s | FileCheck %s --check-prefix=CHECK-GATHER | ||
| // RUN: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-generic-vectorization{enable-vector-masking=true vectorize-to-transfer-gather=true}))" --split-input-file %s | FileCheck %s --check-prefix=CHECK-MASK-GATHER |
There was a problem hiding this comment.
This is okay for now. I will break this into a couple files in a follow-up. Mostly an FYI.
| // CHECK-GATHER-LABEL: func.func @implicit_gather_like_generic_stride_2 | ||
| // CHECK-GATHER-SAME: %[[IN:[a-zA-Z0-9]+]]: tensor<1x1x31xf32> | ||
| // CHECK-GATHER-SAME: %[[OUT:[a-zA-Z0-9]+]]: tensor<1x1x1x1x16xf32> | ||
| // CHECK-GATHER-DAG: %[[C0:.+]] = arith.constant 0 : index | ||
| // CHECK-GATHER-DAG: %[[DENSE:.+]] = arith.constant dense<2> : vector<16xindex> | ||
| // CHECK-GATHER-DAG: %[[STEP:.+]] = vector.step : vector<16xindex> | ||
| // CHECK-GATHER: %[[INDICES:.+]] = arith.muli %[[STEP]], %[[DENSE]] : vector<16xindex> | ||
| // CHECK-GATHER: %[[GATHER:.+]] = iree_vector_ext.transfer_gather %[[IN]][%[[C0]], %[[C0]], %[[C0]]] | ||
| // CHECK-GATHER: %[[RESULT:.+]] = vector.transfer_write %[[GATHER]], %[[OUT]][%[[C0]], %[[C0]], %[[C0]], %[[C0]], %[[C0]]] | ||
| // CHECK-GATHER: return %[[RESULT]] |
There was a problem hiding this comment.
Please make checks aligned/formated like above case.
| auto inType = llvm::cast<RankedTensorType>(inOperand->get().getType()); | ||
| auto outType = llvm::cast<RankedTensorType>(outOperand->get().getType()); |
There was a problem hiding this comment.
style nit: drop llvm:: prefix.
| rewriter.replaceOp(op, transferWriteOp.getResult()); | ||
| return success(); |
There was a problem hiding this comment.
Can we move it to driver? I know the other method has this pattern, but it should be fixed. I'm fixing it in my other vectorization interface implementation: #23713
Can we make it this way in the first place?
Currently, it lowers to a |
Can we add a test case for the case that What is the plan for the lowering of the case that fastest moving dimension is a gathered load? Is it unrolled? (Maybe @Groverkss knows the answers?) |
This is the test case func.func @implicit_gather_like_generic_stride_2(%arg0: tensor<1x1x31xf32>, %arg1: tensor<1x1x1x1x16xf32>) -> tensor<1x1x1x1x16xf32> {
%0 = linalg.generic {
indexing_maps = [
affine_map<(d0, d1, d2, d3, d4) -> (d0, d1, d4 * 2)>,
affine_map<(d0, d1, d2, d3, d4) -> (d0, d1, d2, d3, d4)>
],
iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel"]
} ins(%arg0 : tensor<1x1x31xf32>) outs(%arg1 : tensor<1x1x1x1x16xf32>) {
^bb0(%in: f32, %out: f32):
linalg.yield %in : f32
} -> tensor<1x1x1x1x16xf32>
return %0 : tensor<1x1x1x1x16xf32>
}
// CHECK-GATHER-LABEL: func.func @implicit_gather_like_generic_stride_2
// CHECK-GATHER-SAME: %[[IN:[a-zA-Z0-9]+]]: tensor<1x1x31xf32>
// CHECK-GATHER-SAME: %[[OUT:[a-zA-Z0-9]+]]: tensor<1x1x1x1x16xf32>
// CHECK-GATHER-DAG: %[[C0:.+]] = arith.constant 0 : index
// CHECK-GATHER-DAG: %[[DENSE:.+]] = arith.constant dense<2> : vector<16xindex>
// CHECK-GATHER-DAG: %[[STEP:.+]] = vector.step : vector<16xindex>
// CHECK-GATHER: %[[INDICES:.+]] = arith.muli %[[STEP]], %[[DENSE]] : vector<16xindex>
// CHECK-GATHER: %[[GATHER:.+]] = iree_vector_ext.transfer_gather %[[IN]][%[[C0]], %[[C0]], %[[C0]]]
// CHECK-GATHER: %[[RESULT:.+]] = vector.transfer_write %[[GATHER]], %[[OUT]][%[[C0]], %[[C0]], %[[C0]], %[[C0]], %[[C0]]]
// CHECK-GATHER: return %[[RESULT]]Since the generic has stride as 2 in the indexing map. After generic vectorization, when the canocalizer runs, transfer_gather will remain as it is and won't be lowered to a Edit: Sorry, this test case is already there. |
Sorry, it was not obvious to me. It's good if we already have it. Can you check the indexing maps as well? They are important to vectorization, right? (I should run it myself, but I don't have a fresh build yet.) |
Signed-off-by: NoumanAmir657 <[email protected]>
|
@hanhanW I don't think the CI failure is related to this PR. Let me know if further changes are required. Thanks! |
| SmallVector<AffineExpr> sourceExprs; | ||
| for (int i = 0; i < inRank - 1; ++i) { | ||
| sourceExprs.push_back(rewriter.getAffineDimExpr(i)); | ||
| } | ||
| sourceExprs.push_back(rewriter.getAffineSymbolExpr(0)); |
There was a problem hiding this comment.
Source map construction is incorrect for general indexing maps
VectorizeIREEVectorExtOps.cpp lines 458-462:
for (int i = 0; i < inRank - 1; ++i) {
sourceExprs.push_back(rewriter.getAffineDimExpr(i));
}
sourceExprs.push_back(rewriter.getAffineSymbolExpr(0));
This hardcodes the contiguous source dimensions to d0, d1, ..., d_{inRank-2}, but the actual input map's contiguous dims may use different loop dimensions. For example, if the input map is (d0, d1, d2, d3, d4) -> (d2, d1, d4 * 2),
this code generates source map (d0,d1,d2,d3,d4)[s0] -> (d0, d1, s0) when it should be (d0,d1,d2,d3,d4)[s0] -> (d2, d1, s0).
The fix: extract the actual dim expressions from the input map's contiguous results rather than assuming they are sequential from d0
There was a problem hiding this comment.
Addressed in the below comment.
| AffineMap indexMap = AffineMap::get( | ||
| numLoops, 1, {rewriter.getAffineDimExpr(numLoops - 1)}, ctx); |
There was a problem hiding this comment.
This assumes the gathered (non-affine) expression always involves the last loop dimension (d4 in the test). But isImplicitGather doesn't enforce this - it only checks that the input map is "not a projected permutation." A map like (d0, d1, d2, d3, d4) -> (d0, d2 * 2, d1) would pass the detection but be incorrectly vectorized because the gathered dim is d2, not d4.
There was a problem hiding this comment.
I can constraint it like this:
- Constraint output to be identity indexing map
- Constraint the last dim of input indexing map to be a AffineBinaryExpr or an AffineDimExpr. If
dkis last dim of output indexing map then the last index of input indexing map should bedk * stride + dmwheredm!=dkor just bedk. - For the above to work, another constraint would be needed which basically checks whether all leading dims of input and output are 1, except the last dim. In that case, the gathered expression will basically become
dk * stride. And it would ensure that we are only gathering/loading from last dimension only.
These checks seems to work on the usecase that I am targeting. It will also work on the case where we have multiple gather dims expressions because of the unit dim check as mentioned in the third point above.
Do you think these constraints are okay to move forward with?
There was a problem hiding this comment.
For a case like the below:
// input indexing map
#map1 = affine_map<(d0, d1, d2, d3, d4) -> (d0, d3 * 2 + d1, d4 * 2 + d2)>
// output indexing map
#map2 = affine_map<(d0, d1, d2, d3, d4) -> (d0, d1, d2, d3, d4)>Since, we constrianed that the load/gather will be a form of d4 * stride + d_whatever_except_d4 and that all leading dims uptil the last dim are 1 (input tensor<1x1x31> and output tensor<1x1x1x1x16), for the source map we will have all affine_map<(d0, d1, d2, d3, d4)[s0] -> (0, 0, s0)>. All the leading dims except the last dim of input will be 0.
There was a problem hiding this comment.
func.func @implicit_gather_like_generic_stride_2(%arg0: tensor<1x1x31xf32>, %arg1: tensor<1x1x1x1x16xf32>) -> tensor<1x1x1x1x16xf32> {
%0 = linalg.generic {
indexing_maps = [
affine_map<(d0, d1, d2, d3, d4) -> (d0, d1, d4 * 2 + d0)>,
affine_map<(d0, d1, d2, d3, d4) -> (d0, d1, d2, d3, d4)>
],
iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel"]
} ins(%arg0 : tensor<1x1x31xf32>) outs(%arg1 : tensor<1x1x1x1x16xf32>) {
^bb0(%in: f32, %out: f32):
linalg.yield %in : f32
} -> tensor<1x1x1x1x16xf32>
return %0 : tensor<1x1x1x1x16xf32>
}
// CHECK-GATHER: #[[$MAP0:.+]] = affine_map<(d0, d1, d2, d3, d4)[s0] -> (0, 0, s0)>
// CHECK-GATHER: #[[$MAP1:.+]] = affine_map<(d0, d1, d2, d3, d4)[s0] -> (d4)>
// CHECK-GATHER-LABEL: func.func @implicit_gather_like_generic_stride_2
// CHECK-GATHER-SAME: %[[IN:[a-zA-Z0-9]+]]: tensor<1x1x31xf32>
// CHECK-GATHER-SAME: %[[OUT:[a-zA-Z0-9]+]]: tensor<1x1x1x1x16xf32>
// CHECK-GATHER-DAG: %[[C0:.+]] = arith.constant 0 : index
// CHECK-GATHER-DAG: %[[DENSE:.+]] = arith.constant dense<2> : vector<16xindex>
// CHECK-GATHER-DAG: %[[PASSTHRU:.+]] = arith.constant 0.000000e+00 : f32
// CHECK-GATHER-DAG: %[[STEP:.+]] = vector.step : vector<16xindex>
// CHECK-GATHER: %[[INDICES:.+]] = arith.muli %[[STEP]], %[[DENSE]] : vector<16xindex>
// CHECK-GATHER: %[[GATHER:.+]] = iree_vector_ext.transfer_gather %[[IN]][%[[C0]], %[[C0]], %[[C0]]]
// CHECK-GATHER-SAME: [%[[INDICES]] : vector<16xindex>], %[[PASSTHRU]]
// CHECK-GATHER-SAME: {indexing_maps = [#[[$MAP0]], #[[$MAP1]]]}
// CHECK-GATHER: %[[RESULT:.+]] = vector.transfer_write %[[GATHER]], %[[OUT]][%[[C0]], %[[C0]], %[[C0]], %[[C0]], %[[C0]]]
// CHECK-GATHER: return %[[RESULT]]Future PRs can make isImplicitGather less conservative.
| return success(maxFlatVecSize < maxVectorSize); | ||
| } | ||
|
|
||
| static bool isImplicitGather(linalg::GenericOp genericOp) { |
There was a problem hiding this comment.
We should be conservative in codegen lowering, i.e., we should prevent incorrect lowering in the first place.
The check and the implementation are not aligned to me, so I asked claude to provide the comment:
isImplicitGather is too permissive for the transformation's assumptions
The detection function (GenericVectorization.cpp lines 148-177) checks:
- All parallel loops
- Exactly 1 input, 1 output
- Input map is NOT a projected permutation
- Output map IS a projected permutation
- Body has only the yield op
But vectorizeImplicitGatherToTransferGather additionally assumes:
- Only the last result expression of the input map is non-trivial
- The contiguous dims are d0..d_{n-2} in sequential order
- The gathered expression involves only d_{numLoops-1}
- The output map is identity (not just projected permutation)
If isImplicitGather returns true but these structural assumptions don't hold, the generated IR is silently incorrect. The detection function must validate the exact structural requirements, or the transformation must analyze the actual map expressions.
| auto transferWriteOp = vector::TransferWriteOp::create( | ||
| rewriter, loc, transferGatherOp.getResult(), outOperand->get(), | ||
| writeOffsets); |
There was a problem hiding this comment.
Similar comment for the mismatch:
The transfer_write uses identity offsets and no permutation map. But isImplicitGather allows any projected permutation for the output map (e.g., (d0,d1,d2,d3,d4) -> (d4,d3,d2,d1,d0)). For non-identity output maps, the write would place elements in incorrect positions. Either restrict isImplicitGather to identity output maps, or generate the correct permutation map for the write.
There was a problem hiding this comment.
isImplictiGather restriced to output identity maps.
| ArrayRef<int64_t> vectorSizes) { | ||
| Location loc = op.getLoc(); | ||
| MLIRContext *ctx = rewriter.getContext(); | ||
| rewriter.setInsertionPoint(op); |
There was a problem hiding this comment.
We should have the insertion point guard when you move the insertion point. IMO, it's callee's responsibility to restore the insertion point. Otherwise, you spread the logic to callers.
| rewriter.modifyOpInPlace( | ||
| transferWriteOp, [&] { transferWriteOp.getMaskMutable().assign(mask); }); |
There was a problem hiding this comment.
Why not pass mask to the builder?
There was a problem hiding this comment.
I decided to consider only static cases for this PR so masking won't be needed. Fixed in recent commit.
| } | ||
|
|
||
| Block *body = genericOp.getBlock(); | ||
| return std::distance(body->begin(), body->end()) == 1; |
There was a problem hiding this comment.
nit: use llvm::hasSingleElement(*body)
| int64_t stride = 1; | ||
| inputMap.getResult(inputMap.getNumResults() - 1).walk([&](AffineExpr sub) { | ||
| if (auto mul = dyn_cast<AffineBinaryOpExpr>(sub)) { | ||
| if (mul.getKind() == AffineExprKind::Mul) { | ||
| if (auto rhs = dyn_cast<AffineConstantExpr>(mul.getRHS())) { | ||
| stride = rhs.getValue(); | ||
| } | ||
| } | ||
| } | ||
| }); |
There was a problem hiding this comment.
This is fragile and it is wrong for some cases: It treats d4 * 2 + 1 as the same as d4 * 2.
There was a problem hiding this comment.
Made the isImplicitGather more conservative so the case d4 * 2 + 1 won't be considered for vectorization.
Signed-off-by: NoumanAmir657 <[email protected]>
hanhanW
left a comment
There was a problem hiding this comment.
Thanks for addressing the comments, here is a new round of review. Overall looks okay to me, just some nits.
I did not expect that d1e7974 is landed before this. Can you move the implementation to VectorizableOpInterface.cpp after rebase? Thanks!
| if (dim.getPosition() == lastLoopDim) { | ||
| return cast<AffineConstantExpr>(mul.getLHS()).getValue(); | ||
| } | ||
| } |
There was a problem hiding this comment.
MLIR's canonical form guarantees RHS of Mul is always the constant/symbolic expression, so this is a deadcode.
/// RHS of mul is always a constant or a symbolic expression.
| if (auto dim = dyn_cast<AffineDimExpr>(mul.getRHS())) { | ||
| return dim.getPosition() == lastLoopDim && | ||
| isa<AffineConstantExpr>(mul.getLHS()); | ||
| } |
There was a problem hiding this comment.
same here, this is deadcode.
| auto inputMap = | ||
| genericOp.getMatchingIndexingMap(genericOp.getDpsInputOperand(0)); | ||
| auto outputMap = | ||
| genericOp.getMatchingIndexingMap(genericOp.getDpsInitOperand(0)); |
| auto gatherResult = | ||
| IREE::VectorExt::vectorizeImplicitGatherToTransferGather( | ||
| rewriter, genericOp, vectorSizes); |
| auto *inOperand = op.getDpsInputOperand(0); | ||
| auto *outOperand = op.getDpsInitOperand(0); |
| AffineMap sourceMap = | ||
| AffineMap::get(numLoops, /*symbolCount=*/1, sourceExprs, ctx); | ||
|
|
||
| AffineMap indexMap = | ||
| AffineMap::get(numLoops, /*symbolCount=*/1, | ||
| {rewriter.getAffineDimExpr(numLoops - 1)}, ctx); |
There was a problem hiding this comment.
nit: these can use auto because ::get spells the type.
| llvm_unreachable("isImplicitGather should have rejected this expression"); | ||
| } | ||
|
|
||
| FailureOr<Value> |
There was a problem hiding this comment.
I think we can always return Value. It should not fail.
Signed-off-by: NoumanAmir657 <[email protected]>
| if (isImplicitGather(genericOp)) { | ||
| return true; | ||
| } |
There was a problem hiding this comment.
You should check with vectorizeToTransferGather flag.
Signed-off-by: NoumanAmir657 <[email protected]>
|
https://github.com/iree-org/iree/actions/runs/23066888975/attempts/2?pr=23664 seems to be flaky, it passes in a re-run. Can you fix the bazel issue? |
Signed-off-by: NoumanAmir657 <[email protected]>
Fixed in lastest commit. |
|
@hanhanW The CI is passing now. Can you merge this for me? Thanks! |
This PR adds support for what was discussed in this RFC.
The RFC proposes the
tensor.extractapproach but after further discussion with @Groverkss on discord, it was decided to lower the generic toiree_vector_ext.transfer_gather. The e2e lowering won't work for now in case of strided load from fastest moving dimension becausetransfer_gatherlacks the folding pattern for gathering from fastest moving dimension.