[SPARK-53272][SQL] Refactor SPJ pushdown logic out of BatchScanExec by chirag-s-db · Pull Request #51979 · apache/spark

chirag-s-db · 2025-08-11T23:30:34Z

What changes were proposed in this pull request?

SPJ logic is currently closely coupled with the DSV2-specific BatchScanExec physical node, making it difficult for connectors to take advantage of SPJ for other types of scans. This PR refactors the SPJ-specific logic out of BatchScanExec, exposing a parameterized base class for connectors to use. This base class requires a partition value accessor (mapping from the parameterized type to an InternalRow).

Why are the changes needed?

Allow connectors to take advantage of SPJ on existing scans.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pure refactor - existing tests should suffice.

Was this patch authored or co-authored using generative AI tooling?

No.

rahulsmahadev

LGTM! looks clean, only material change is the logic is encapsulated in partitionValueAccessor

cloud-fan · 2025-08-12T12:52:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/KeyGroupedPartitionedScan.scala

Suggested change

case Some(projectionPositions) => basePartitioning.partitionValues.map{r =>

case Some(projectionPositions) => basePartitioning.partitionValues.map { r =>

cloud-fan · 2025-08-12T12:55:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/KeyGroupedPartitionedScan.scala

just for my curiosity: what's the relationship between p.expressions and spjParams.keyGroupedPartitioning?

p.expressions includes join key reordering of the expressions (ref), while spjParams.keyGroupedPartitioning contains the partitioning expressions in their original ordering (which is why they must be reordered here if join key positions are present).

cloud-fan · 2025-08-12T12:55:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/KeyGroupedPartitionedScan.scala

can we move StoragePartitionJoinParams to an individual file instead of BatchScanExec.scala?

Sure, done.

chirag-s-db · 2025-08-12T14:39:00Z

Also cc: @sunchao and @szehon-ho for visibility

HyukjinKwon · 2025-08-12T23:48:51Z

Let's file a JIRA and add it to the PR title.

cloud-fan · 2025-08-13T16:01:42Z

@chirag-s-db can you follow the instructions and set up your Github Action? https://github.com/apache/spark/pull/51979/checks?check_run_id=47918325385

chirag-s-db · 2025-08-13T16:14:41Z

@chirag-s-db can you follow the instructions and set up your Github Action? https://github.com/apache/spark/pull/51979/checks?check_run_id=47918325385

@cloud-fan Checks should be running now, had to rebase on latest master.

sunchao

LGTM, thanks @chirag-s-db !

cloud-fan · 2025-08-14T00:24:59Z

The test job times out, but this refactor shouldn't change anything. Thanks, merging to master!

szehon-ho

Late lgtm, thanks @chirag-s-db !

…him and AbstractBatchScanExec see apache/spark#51979

| Cause | Type | Category | Description | Affected Files | |-------|------|----------|-------------|----------------| | - | Feat | Feature | Introduce Spark41Shims and update build configuration to support Spark 4.1. | pom.xml shims/pom.xml shims/spark41/pom.xml shims/spark41/.../META-INF/services/org.apache.gluten.sql.shims.SparkShimProvider shims/spark41/.../spark41/Spark41Shims.scala shims/spark41/.../spark41/SparkShimProvider.scala | | [#51477](apache/spark#51477) | Fix | Compatibility | Use class name instead of class object for streaming call detection to ensure Spark 4.1 compatibility. | gluten-core/.../caller/CallerInfo.scala | | [#50852](apache/spark#50852) | Fix | Compatibility | Add printOutputColumns parameter to generateTreeString methods | shims/spark41/.../GenerateTreeStringShim.scala | | [#51775](apache/spark#51775) | Fix | Compatibility | Remove unused MDC import in FileSourceScanExecShim.scala | shims/spark41/.../FileSourceScanExecShim.scala | | [#51979](apache/spark#51979) | Fix | Compatibility | Add missing StoragePartitionJoinParams import in BatchScanExecShim and AbstractBatchScanExec | shims/spark41/.../v2/AbstractBatchScanExec.scala shims/spark41/.../v2/BatchScanExecShim.scala | | [#51302](apache/spark#51302) | Fix | Compatibility | Remove TimeAdd from ExpressionConverter and ExpressionMappings for test | gluten-substrait/.../ExpressionConverter.scala gluten-substrait/.../ExpressionMappings.scala | | [#50598](apache/spark#50598) | Fix | Compatibility | Adapt to QueryExecution.createSparkPlan interface change | gluten-substrait/.../GlutenImplicits.scala shims/spark*/.../QueryExecutionShim.scala | | [#52599](apache/spark#52599) | Fix | Compatibility | Adapt to DataSourceV2Relation interface change | backends-velox/.../ArrowConvertorRule.scala shims/spark*/.../v2/DataSourceV2RelationShim.scala | | [#52384](apache/spark#52384) | Fix | Compatibility | Using new interface of ParquetFooterReader | backends-velox/.../ParquetMetadataUtils.scala gluten-ut/spark40/.../parquet/GlutenParquetRowIndexSuite.scala shims/spark*/.../parquet/ParquetFooterReaderShim.scala | | [#52509](apache/spark#52509) | Fix | Build | Update Scala version to 2.13.17 in pom.xml to fix `java.lang.NoSuchMethodError: 'java.lang.String scala.util.hashing.MurmurHash3$.caseClassHash$default$2()'` | pom.xml | | - | Fix | Test | Refactor Spark version checks in VeloxHashJoinSuite to improve readability and maintainability | backends-velox/.../VeloxHashJoinSuite.scala | | [#50849](apache/spark#50849) | Fix | Test | Fix MiscOperatorSuite to support OneRowRelationExec plan Spark 4.1 | backends-velox/.../MiscOperatorSuite.scala | | [#52723](apache/spark#52723) | Fix | Compatibility | Add GeographyVal and GeometryVal support in ColumnarArrayShim | shims/spark41/.../vectorized/ColumnarArrayShim.java | | [#48470](apache/spark#48470) | 4.1.0 | Exclude | Exclude split test in VeloxStringFunctionsSuite | backends-velox/.../VeloxStringFunctionsSuite.scala | | [#51259](apache/spark#51259) | 4.1.0 | Exclude | Only Run ArrowEvalPythonExecSuite tests up to Spark 4.0， we need update ci python to 3.10 | backends-velox/.../python/ArrowEvalPythonExecSuite.scala |

| Cause | Type | Category | Description | Affected Files | |-------|------|----------|-------------|----------------| | - | Feat | Feature | Introduce Spark41Shims and update build configuration to support Spark 4.1. | pom.xml shims/pom.xml shims/spark41/pom.xml shims/spark41/.../META-INF/services/org.apache.gluten.sql.shims.SparkShimProvider shims/spark41/.../spark41/Spark41Shims.scala shims/spark41/.../spark41/SparkShimProvider.scala | | [#51477](apache/spark#51477) | Fix | Compatibility | Use class name instead of class object for streaming call detection to ensure Spark 4.1 compatibility. | gluten-core/.../caller/CallerInfo.scala | | [#50852](apache/spark#50852) | Fix | Compatibility | Add printOutputColumns parameter to generateTreeString methods | shims/spark41/.../GenerateTreeStringShim.scala | | [#51775](apache/spark#51775) | Fix | Compatibility | Remove unused MDC import in FileSourceScanExecShim.scala | shims/spark41/.../FileSourceScanExecShim.scala | | [#51979](apache/spark#51979) | Fix | Compatibility | Add missing StoragePartitionJoinParams import in BatchScanExecShim and AbstractBatchScanExec | shims/spark41/.../v2/AbstractBatchScanExec.scala shims/spark41/.../v2/BatchScanExecShim.scala | | [#51302](apache/spark#51302) | Fix | Compatibility | Remove TimeAdd from ExpressionConverter and ExpressionMappings for test | gluten-substrait/.../ExpressionConverter.scala gluten-substrait/.../ExpressionMappings.scala | | [#50598](apache/spark#50598) | Fix | Compatibility | Adapt to QueryExecution.createSparkPlan interface change | gluten-substrait/.../GlutenImplicits.scala shims/spark\*/.../shims/spark\*/Spark*Shims.scala | | [#52599](apache/spark#52599) | Fix | Compatibility | Adapt to DataSourceV2Relation interface change | backends-velox/.../ArrowConvertorRule.scala | | [#52384](apache/spark#52384) | Fix | Compatibility | Using new interface of ParquetFooterReader | backends-velox/.../ParquetMetadataUtils.scala gluten-ut/spark40/.../parquet/GlutenParquetRowIndexSuite.scala shims/spark*/.../parquet/ParquetFooterReaderShim.scala | | [#52509](apache/spark#52509) | Fix | Build | Update Scala version to 2.13.17 in pom.xml to fix `java.lang.NoSuchMethodError: 'java.lang.String scala.util.hashing.MurmurHash3$.caseClassHash$default$2()'` | pom.xml | | - | Fix | Test | Refactor Spark version checks in VeloxHashJoinSuite to improve readability and maintainability | backends-velox/.../VeloxHashJoinSuite.scala | | [#50849](apache/spark#50849) | Fix | Test | Fix MiscOperatorSuite to support OneRowRelationExec plan Spark 4.1 | backends-velox/.../MiscOperatorSuite.scala | | [#52723](apache/spark#52723) | Fix | Compatibility | Add GeographyVal and GeometryVal support in ColumnarArrayShim | shims/spark41/.../vectorized/ColumnarArrayShim.java | | [#48470](apache/spark#48470) | 4.1.0 | Exclude | Exclude split test in VeloxStringFunctionsSuite | backends-velox/.../VeloxStringFunctionsSuite.scala | | [#51259](apache/spark#51259) | 4.1.0 | Exclude | Only Run ArrowEvalPythonExecSuite tests up to Spark 4.0， we need update ci python to 3.10 | backends-velox/.../python/ArrowEvalPythonExecSuite.scala |

github-actions bot added the SQL label Aug 11, 2025

rahulsmahadev approved these changes Aug 11, 2025

View reviewed changes

cloud-fan reviewed Aug 12, 2025

View reviewed changes

chirag-s-db requested a review from cloud-fan August 12, 2025 15:01

chirag-s-db changed the title ~~[SQL] Refactor SPJ pushdown logic out of BatchScanExec~~ [SPARK-53272][SQL] Refactor SPJ pushdown logic out of BatchScanExec Aug 13, 2025

cloud-fan approved these changes Aug 13, 2025

View reviewed changes

chirag-s-db added 2 commits August 13, 2025 09:07

fixes

ef85b8e

review comments

1f1c551

chirag-s-db force-pushed the refactor-batch-scan branch from 23d337b to 1f1c551 Compare August 13, 2025 16:08

sunchao approved these changes Aug 13, 2025

View reviewed changes

cloud-fan closed this in 9297712 Aug 14, 2025

szehon-ho reviewed Aug 17, 2025

View reviewed changes

baibaichen added a commit to apache/gluten that referenced this pull request Dec 31, 2025

[Fix] Add missing StoragePartitionJoinParams import in BatchScanExecS…

8f934a3

…him and AbstractBatchScanExec see apache/spark#51979

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 31, 2025

[Fix] Add missing StoragePartitionJoinParams import in BatchScanExecS…

6609a3a

…him and AbstractBatchScanExec see apache/spark#51979

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 4, 2026

[Fix] Add missing StoragePartitionJoinParams import in BatchScanExecS…

e0d9bc4

…him and AbstractBatchScanExec see apache/spark#51979

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 4, 2026

[Fix] Add missing StoragePartitionJoinParams import in BatchScanExecS…

12b4017

…him and AbstractBatchScanExec see apache/spark#51979

baibaichen mentioned this pull request Jan 5, 2026

[GLUTEN-11346][CORE][VL] Add Spark 4.1 Shim Layer apache/gluten#11347

Merged

	case Some(projectionPositions) => basePartitioning.partitionValues.map{r =>
	case Some(projectionPositions) => basePartitioning.partitionValues.map { r =>

Conversation

chirag-s-db commented Aug 11, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

rahulsmahadev left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

chirag-s-db Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

chirag-s-db Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

chirag-s-db Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

chirag-s-db commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Aug 12, 2025

Uh oh!

cloud-fan commented Aug 13, 2025

Uh oh!

chirag-s-db commented Aug 13, 2025

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Aug 14, 2025

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

chirag-s-db commented Aug 12, 2025 •

edited

Loading