S3 cluster hive pruning by ianton-ru · Pull Request #93284 · ClickHouse/ClickHouse

ianton-ru · 2025-12-31T14:15:36Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a [user-readable short description]

s3Cluster table function optimization with use_hive_partitioning setting.

Documentation entry for user-facing changes

s3 table function with use_hive_partitioning setting can skip loading unused objects based on object paths.
s3Cluster ignored this and loaded all objects.

s3 uses Filter step to filter unused files. s3Cluster can't use this step, because doesn't have column to filter data after loading.
Added new step ObjectFilter to filter only loading objects.

Documentation is written (mandatory for new features)

Note

Medium Risk
Touches distributed query planning and filter pushdown for cluster table functions; mistakes could change which objects are scanned or break predicate propagation/serialization across nodes.

Overview
Adds a new query plan step ObjectFilterStep (serializable/registrable) to carry a WHERE predicate specifically for object-path/virtual-column pruning in distributed object-storage reads.

When use_hive_partitioning is enabled, InterpreterSelectQuery now injects ObjectFilterStep on the initiator for ReadFromCluster plans so s3Cluster can skip unrelated Hive-partitioned objects even if the filtering column isn’t present in returned blocks; the primary-key/limit optimization pass is updated to push this step’s predicate down into SourceStepWithFilter.

Refactors ReadFromCluster into IStorageCluster.h and updates cluster filtering to use either locally-added filter DAGs or query_info.filter_actions_dag, plus adjusts object-storage cluster iteration to use getVirtualsList(); adds an integration test validating reduced file reads for s3Cluster with Hive partitioning.

^{Written by Cursor Bugbot for commit 25cfa06. This will update automatically on new commits. Configure here.}

clickhouse-gh · 2025-12-31T17:09:48Z

Workflow [PR], commit [ab4a5b6]

Summary: ❌

job_name	test_name	status	info	comment
Finish Workflow		failure
	python3 ./ci/jobs/scripts/workflow_hooks/pr_body_check.py	failure

tests/integration/test_s3_cluster/test.py

ianton-ru · 2026-01-12T10:21:33Z

Failed tests test_peak_memory_usage and test_backup_restore_on_cluster look unrelated.

Copilot

Pull request overview

This PR optimizes the s3Cluster table function by implementing object-level filtering when use_hive_partitioning is enabled. Previously, s3Cluster loaded all objects and only filtered data after loading, unlike the regular s3 function which could skip loading unused objects based on partition paths. This change introduces a new ObjectFilterStep that filters objects before loading them, matching the optimization behavior of the non-cluster variant.

Changes:

Introduced ObjectFilterStep to filter S3 objects based on hive partitioning before loading
Modified ReadFromCluster class to be defined in the header file instead of inline in the implementation
Added integration test to verify the optimization works for both s3 and s3Cluster functions

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/integration/test_s3_cluster/test.py	Added comprehensive test verifying hive partitioning optimization reduces file reads for both `s3` and `s3Cluster`
src/Storages/ObjectStorage/StorageObjectStorageCluster.h	Removed unused `virtual_columns` field
src/Storages/ObjectStorage/StorageObjectStorageCluster.cpp	Updated to use `getVirtualsList()` instead of removed `virtual_columns` field
src/Storages/IStorageCluster.h	Moved `ReadFromCluster` class definition from .cpp to header for broader visibility
src/Storages/IStorageCluster.cpp	Removed inline `ReadFromCluster` class definition (moved to header)
src/Processors/QueryPlan/QueryPlanStepRegistry.cpp	Registered new `ObjectFilterStep` in the query plan step registry
src/Processors/QueryPlan/Optimizations/optimizePrimaryKeyConditionAndLimit.cpp	Added handling for `ObjectFilterStep` in query plan optimization
src/Processors/QueryPlan/ObjectFilterStep.h	New header defining `ObjectFilterStep` for filtering objects before loading
src/Processors/QueryPlan/ObjectFilterStep.cpp	Implementation of `ObjectFilterStep` with serialization support
src/Planner/Planner.cpp	Added logic to insert `ObjectFilterStep` for cluster queries with hive partitioning
src/Interpreters/InterpreterSelectQuery.cpp	Added logic to insert `ObjectFilterStep` for cluster queries with hive partitioning in legacy interpreter

Copilot · 2026-01-12T13:14:27Z

tests/integration/test_s3_cluster/test.py

 import os
 import shutil
-import time
+import uuid


The uuid module is imported but the standard library uuid.uuid4() is only used for generating unique query IDs in the test. Consider using the existing random_string helper from helpers.utility (already imported on line 14) for consistency with the codebase pattern, unless UUID format is specifically required for query ID tracking.

In most integration tests where query_id is used it is created based on uuid.

Copilot · 2026-01-12T13:14:28Z

src/Processors/QueryPlan/ObjectFilterStep.cpp

+
+QueryPipelineBuilderPtr ObjectFilterStep::updatePipeline(QueryPipelineBuilders pipelines, const BuildQueryPipelineSettings & /* settings */)
+{
+    return std::move(pipelines.front());


The updatePipeline method returns the pipeline unchanged without applying the filter transformation. This means the ObjectFilterStep does not actually filter any data in the pipeline execution. The method should create and add a FilterTransform (already included in headers) using actions_dag and filter_column_name to perform the actual filtering.

Suggested change

return std::move(pipelines.front());

auto pipeline = std::move(pipelines.front());

pipeline->addSimpleTransform([this](const Block & header)

{

return std::make_shared<FilterTransform>(

header,

std::make_shared<ActionsDAG>(actions_dag),

filter_column_name,

false);

});

return pipeline;

I created ObjectFilterStep exactly because can't use FilterStep for this purpose.

Copilot · 2026-01-12T13:14:28Z

src/Planner/Planner.cpp

+            && !query_processing_info.isFirstStage()
+            && expression_analysis_result.hasWhere())
+        {
+            if (typeid_cast<ReadFromCluster *>(query_plan.getRootNode()->step.get()))


The logic to insert ObjectFilterStep is duplicated between Planner.cpp and InterpreterSelectQuery.cpp with nearly identical conditions and implementations. Consider extracting this into a shared helper function to reduce duplication and ensure consistent behavior across both query execution paths.

I'm not sure if this makes sense given future plans to remove old analyzer.

Copilot · 2026-01-12T13:14:29Z

src/Interpreters/InterpreterSelectQuery.cpp

+                && !expressions.first_stage
+                && expressions.hasWhere())
+            {
+                if (typeid_cast<ReadFromCluster *>(query_plan.getRootNode()->step.get()))


The logic to insert ObjectFilterStep is duplicated between InterpreterSelectQuery.cpp and Planner.cpp with nearly identical conditions and implementations. Consider extracting this into a shared helper function to reduce duplication and ensure consistent behavior across both query execution paths.

The same as above.

kssenii · 2026-01-14T14:27:18Z

src/Processors/QueryPlan/ObjectFilterStep.h

+namespace DB
+{
+
+/// Implements WHERE operation.


Please add a short comment here describing key difference from FilterStep and why/where it is needed

kssenii · 2026-01-14T14:33:05Z

tests/integration/test_s3_cluster/test.py

+            FROM s3Cluster(cluster_simple, 'http://minio1:9001/{data_path}/key=**.parquet', 'minio', '{minio_secret_key}', 'Parquet', 'key Int32, value Int32')
+            WHERE key <= 2
+            FORMAT TSV
+            SETTINGS enable_filesystem_cache = 0, use_query_cache = 0, use_cache_for_count_from_files = 0, use_hive_partitioning = 1, allow_experimental_analyzer={allow_experimental_analyzer}


Let's also add a test with partition_strategy='hive' as it works a bit differently from default hive setting.

As I understand, partition_strategy makes sense only for INSERT, not for SELECT.

The difference is also that with that setting hive columns start to be physical, while without it they are virtual (if not directly specified in schema).

ianton-ru · 2026-01-29T12:27:27Z

Integration tests were interrupted by timeout:

[2026-01-16 16:01:42] !!!!!!! xdist.dsession.Interrupted: session-timeout: 5400.0 sec exceeded !!!!!!!

ianton-ru · 2026-02-12T12:57:25Z

@kssenii Is anything more required?

kssenii · 2026-02-12T18:00:03Z

Hi, sorry for the wait! I wanted to verify something before merge and postponed too much, thank you for pinging...
I checked and it looks like in case of new analyzer we can simplify this fix a bit: https://pastila.nl/?00acee2b/6be4308e9c92cd4aee7a8c6e0b301c1d#/YzOlYpLBKaA7nyIcM86DQ==GCM. I verified your test passes with this change.
I remembered about this because I was fixing partition pruning for deltaLakeCluster and it is basically same as hive pruning because delta lake stores partition values in hive partitioning style (#82131, this fix was only for new analyzer).

ianton-ru · 2026-03-10T08:01:24Z

@kssenii Does this PR need something else from my side?

kssenii · 2026-03-10T10:11:33Z

Hi! See my above comment - I suggested to simplify this fix and attached a patch in pastila link showing my suggestion.

ianton-ru · 2026-03-10T12:55:50Z

Sorry, I missed it.
Changes are added.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

src/Planner/Planner.cpp

alexbakharew · 2026-03-10T17:36:33Z

Hi @ianton-ru,
could you please merge master into your branch? It will resolve LLVM Coverage issue

ianton-ru · 2026-03-16T12:22:51Z

@alexbakharew Done.

src/Processors/QueryPlan/ObjectFilterStep.h

clickhouse-gh · 2026-03-17T00:28:50Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	83.70%	83.80%	+0.10%
Functions	23.90%	23.90%	+0.00%
Branches	76.30%	76.30%	+0.00%

PR changed lines: PR changed-lines coverage: 82.29% (79/96, 0 noise lines excluded)
Diff coverage report
Uncovered code

ianton-ru added 4 commits December 30, 2025 09:43

s3Cluster hive partitioning

66f759d

Fix hive partitioning in cluster functions for old analyzer

0eb162d

Fix build

0987573

Fix test

cd70fc9

kssenii self-assigned this Dec 31, 2025

kssenii added the can be tested Allows running workflows for external contributors label Dec 31, 2025

clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Dec 31, 2025

Merge branch 'master' into s3_cluster_hive_pruning

b337cf0

kssenii reviewed Jan 5, 2026

View reviewed changes

tests/integration/test_s3_cluster/test.py Outdated Show resolved Hide resolved

tests/integration/test_s3_cluster/test.py Outdated Show resolved Hide resolved

tests/integration/test_s3_cluster/test.py Outdated Show resolved Hide resolved

Fix test

4ff5abe

kssenii requested a review from Copilot January 12, 2026 13:13

Copilot AI reviewed Jan 12, 2026

View reviewed changes

This was referenced Jan 12, 2026

s3Cluster hive partitioning Altinity/ClickHouse#584

Merged

s3Cluster hive partitioning for old analyzer Altinity/ClickHouse#703

Merged

kssenii reviewed Jan 14, 2026

View reviewed changes

ianton-ru added 2 commits January 16, 2026 12:24

Comment for ObjectFilterStep

2e1bae4

Add test with partition_strategy=hive

0c035f3

Change by reviewer suggestions

11ceb87

kssenii approved these changes Mar 10, 2026

View reviewed changes

cursor bot reviewed Mar 10, 2026

View reviewed changes

src/Planner/Planner.cpp Outdated Show resolved Hide resolved

Remove redundant setting declaration

25cfa06

Merge master

0009969

clickhouse-gh bot reviewed Mar 16, 2026

View reviewed changes

src/Processors/QueryPlan/ObjectFilterStep.h Outdated Show resolved Hide resolved

Fix typo

ab4a5b6

-    return std::move(pipelines.front());
+    auto pipeline = std::move(pipelines.front());
+    pipeline->addSimpleTransform([this](const Block & header)
+    {
+        return std::make_shared<FilterTransform>(
+            header,
+            std::make_shared<ActionsDAG>(actions_dag),
+            filter_column_name,
+            false);
+    });
+    return pipeline;

Conversation

ianton-ru commented Dec 31, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a [user-readable short description]

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh bot commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ianton-ru commented Jan 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

ianton-ru Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

ianton-ru Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

ianton-ru Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

ianton-ru Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

kssenii Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

kssenii Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

ianton-ru Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

kssenii Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ianton-ru commented Jan 29, 2026

Uh oh!

ianton-ru commented Feb 12, 2026

Uh oh!

kssenii commented Feb 12, 2026

Uh oh!

ianton-ru commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kssenii commented Mar 10, 2026

Uh oh!

ianton-ru commented Mar 10, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexbakharew commented Mar 10, 2026

Uh oh!

ianton-ru commented Mar 16, 2026

Uh oh!

Uh oh!

clickhouse-gh bot commented Mar 17, 2026

ianton-ru commented Dec 31, 2025 •

edited by cursor bot

Loading

clickhouse-gh bot commented Dec 31, 2025 •

edited

Loading

kssenii Jan 16, 2026 •

edited

Loading

ianton-ru commented Mar 10, 2026 •

edited

Loading