Change object storage cluster table functions to prefer specific replicas to improve cache locality by adikus · Pull Request #77326 · ClickHouse/ClickHouse

adikus · 2025-03-07T20:40:48Z

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Object storage cluster table functions (e.g. s3Cluster) will now assign files to replicas for reading based on consistent hash to improve cache locality.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Addresses: #72816
This works by assigning files from the file iterator to specific replicas.
This PR also implements task stealing, so that if a replica is not available or is slow, other replicas will process its tasks. In terms of filesystem cache locality, though, this means that at the end of iterating through tasks, some task stealing is inevitable, preventing us from achieving perfect cache locality. ~~I've added a 50ms sleep before the stealing is allowed to proceed, which helps a bit, but I'm not sure if that's a good approach.~~

~~Added a object_storage_stable_cluster_task_distribution setting to enable the new behaviour.~~

I'm happy to write some tests if I get a 👍 on the approach.

…icas to improve cache locality

clickhouse-gh · 2025-03-10T23:05:27Z

Workflow [PR], commit [7d8ef02]

src/Storages/ObjectStorage/StorageObjectStorageStableTaskDistributor.cpp

adikus · 2025-03-12T11:53:16Z

Thanks for the review @nickitat, planning to address your comments either today or tomorrow

…ica 💄

adikus · 2025-03-13T19:57:43Z

src/Storages/IStorageCluster.cpp

+    {
+        IConnections::ReplicaInfo replica_info{
+            .number_of_current_replica = replica_index++,
+            .number_of_replicas = number_of_replicas,


Not sure sure about this. I've seen that for parallel replicas we hold replicas_count inside the ParallelReplicasReadingCoordinator, not sure if it makes sense to use that here.

# Conflicts: # src/Storages/IStorageCluster.cpp # src/Storages/ObjectStorage/StorageObjectStorageCluster.cpp

…well

…es history

adikus · 2025-03-14T11:37:04Z

@nickitat Rewritten to use number_of_current_replica, let me know what you think. 🙂

alexey-milovidov · 2025-03-18T06:21:34Z

Thanks for the amazing work, it looks like exactly what we need!
Let's take a look at failed tests.

nickitat · 2025-03-18T12:08:18Z

src/QueryPipeline/RemoteQueryExecutor.cpp

    ProfileEvents::increment(ProfileEvents::ReadTaskRequestsReceived);

-    auto response = (*extension->task_iterator)(connections->getLastPacketConnection());
+    auto response = (*extension->task_iterator)(extension->replica_info->number_of_current_replica, extension->replica_info->number_of_replicas);


extension and replica_info are optionals, i.e. they can be not set.

nickitat · 2025-03-18T12:11:28Z

src/Core/Settings.cpp

 - 1 — `SELECT` returns empty result.
 - 0 — `SELECT` throws an exception.
+)", 0) \
+    DECLARE(Bool, object_storage_stable_cluster_task_distribution, false, R"(


Let's avoid adding a new setting - hardly anybody will want the old behaviour

nickitat · 2025-03-18T12:17:01Z

src/Storages/ObjectStorage/StorageObjectStorageStableTaskDistributor.h

+    std::mutex mutex;
+    bool iterator_exhausted = false;
+
+    LoggerPtr log = getLogger("StorageObjectStorageStableTaskDistributor");


Suggested change

LoggerPtr log = getLogger("StorageObjectStorageStableTaskDistributor");

LoggerPtr log = getLogger("StorageClusterTaskDistributor");

nickitat · 2025-03-18T12:20:36Z

src/Storages/ObjectStorage/StorageObjectStorageCluster.cpp

-            return archive_object_info->getPathToArchive();
+        auto callback = std::make_shared<TaskIterator>(
+            [task_distributor](size_t number_of_current_replica, size_t number_of_replicas) mutable -> String {
+                if (auto next_task = task_distributor->getNextTask(number_of_current_replica, number_of_replicas))


a more concise way would be:

return foo(...).value_or("");

nickitat · 2025-03-18T12:22:13Z

src/Storages/ObjectStorage/StorageObjectStorageStableTaskDistributor.cpp

+    std::shared_ptr<IObjectIterator> iterator_)
+    : iterator(std::move(iterator_))
+    , iterator_exhausted(false)
+    , log(getLogger("StorageObjectStorageStableTaskDistributor"))


pls leave only one initialization; either here or in the header

nickitat · 2025-03-18T13:22:36Z

src/Client/IConnections.h

    struct ReplicaInfo
    {
        size_t number_of_current_replica{0};
+        size_t number_of_replicas{0};


If we add a new field, we have to make sure it is either optional (and it is at least stated in the comment) or (preferably) initialized in all places.
But it is not there already because the total number of replicas doesn't depend on the specific replica making a request. It is a constant that can be provided once to StorageObjectStorageStableTaskDistributor-s constructor.

nickitat · 2025-03-18T14:42:22Z

src/Storages/ObjectStorage/StorageObjectStorageStableTaskDistributor.cpp

+        String next_file = files.back();
+        files.pop_back();
+
+        if (!unprocessed_files.contains(next_file))


a little more performant alternative is find + erase(it).

nickitat · 2025-03-18T14:45:17Z

src/Storages/ObjectStorage/StorageObjectStorageStableTaskDistributor.h

+
+    std::shared_ptr<IObjectIterator> iterator;
+
+    std::unordered_map<size_t, std::vector<String>> connection_to_files;


since replica ids always should lie in range [0; N), it could be replaced with std::vector.

nickitat · 2025-03-18T14:51:52Z

src/Storages/ObjectStorage/StorageObjectStorageStableTaskDistributor.cpp

+
+std::optional<String> StorageObjectStorageStableTaskDistributor::getMatchingFileFromIterator(size_t number_of_current_replica, size_t number_of_replicas)
+{
+    while (!iterator_exhausted)


this access to iterator_exhausted should probably be also guarded by the lock.

Initialize task distributor with number of replicas directly, instead of passing it through replica info

adikus · 2025-04-01T19:16:51Z

Thanks @nickitat and sorry for the delay here.
I incorporated your feedback, let me know how it looks.

adikus · 2025-04-01T19:25:31Z

src/Storages/IStorageCluster.cpp

+    auto max_replicas_to_use = static_cast<UInt64>(cluster->getShardsInfo().size());
+    if (context->getSettingsRef()[Setting::max_parallel_replicas] > 1)
+        max_replicas_to_use = std::min(max_replicas_to_use, context->getSettingsRef()[Setting::max_parallel_replicas].value);
+
+    createExtension(predicate, max_replicas_to_use);


One thing I wasn't quite sure is how the the initialization works here in the applyFilters - not sure what is the code path to get here.

I copied parts from below to get the same number of replicas, so hopefully that will work ok.

adikus · 2025-04-01T22:32:05Z

Looks like two tests failed on the "Replica info is not initialized" check, will have a look at those tomorrow.

adikus · 2025-04-02T12:32:46Z

Initialized replica info on all extensions that also use task iterator, even though it is not currently used for most of them. Not sure if this is going too deep in this PR.

adikus · 2025-04-14T08:25:58Z

Thanks again for the review @nickitat! Could you please take another look?

# Conflicts: # src/Storages/IStorageCluster.cpp # src/Storages/StorageDistributed.cpp

Change object storage cluster table functions to prefer specific repl…

38a3b01

…icas to improve cache locality

nickitat self-assigned this Mar 10, 2025

clickhouse-gh bot added the pr-feature Pull request with new product feature label Mar 10, 2025

nickitat reviewed Mar 10, 2025

View reviewed changes

nickitat added the can be tested Allows running workflows for external contributors label Mar 11, 2025

adikus added 4 commits March 13, 2025 20:17

Rewrite cluster function stable mapping to use number_of_current_replica

9c966b1

Rewrite cluster function stable mapping to use number_of_current_repl…

a396fc3

…ica 💄

Use consistent hash in StorageObjectStorageStableTaskDistributor

57299af

Use std::optional<String> in StorageObjectStorageStableTaskDistributor

1d1cada

adikus commented Mar 13, 2025

View reviewed changes

adikus added 4 commits March 13, 2025 23:27

Merge branch 'master' into ah/stable-s3-filesystem-cache

64f54a4

# Conflicts: # src/Storages/IStorageCluster.cpp # src/Storages/ObjectStorage/StorageObjectStorageCluster.cpp

Use getPathToArchive in StorageObjectStorageStableTaskDistributor as …

ec44a71

…well

Fix trailing whitespaces in StorageObjectStorageStableTaskDistributor

708d732

Add object_storage_stable_cluster_task_distribution to settings chang…

68347d0

…es history

nickitat reviewed Mar 18, 2025

View reviewed changes

adikus added 5 commits April 1, 2025 17:33

Remove object_storage_stable_cluster_task_distribution setting

bc7758a

Address review comments

e9d512a

Initialize task distributor with number of replicas directly, instead of passing it through replica info

Merge branch 'master' into ah/stable-s3-filesystem-cache

4642cab

Remove object_storage_stable_cluster_task_distribution setting ✅

f0818a3

Address review comments ✅

3187472

adikus commented Apr 1, 2025

View reviewed changes

adikus added 2 commits April 1, 2025 21:52

Address review comments ✅

e2a992d

Address review comments ✅

dbd3d4b

ianton-ru mentioned this pull request Apr 1, 2025

Change object storage cluster table functions to prefer specific repl… Altinity/ClickHouse#708

Closed

Address review comments ✅

28a237b

Properly initialize replica info wherever we use task iterator

17d8de0

arthurpassos mentioned this pull request Apr 6, 2025

Rendezvous hashing filesystem cache Altinity/ClickHouse#709

Merged

nickitat approved these changes Apr 15, 2025

View reviewed changes

Merge branch 'master' into ah/stable-s3-filesystem-cache

7d8ef02

# Conflicts: # src/Storages/IStorageCluster.cpp # src/Storages/StorageDistributed.cpp

nickitat added pr-improvement Pull request with some product improvements and removed pr-feature Pull request with new product feature labels Apr 25, 2025

nickitat enabled auto-merge April 25, 2025 20:50

nickitat added this pull request to the merge queue Apr 25, 2025

Merged via the queue into ClickHouse:master with commit 477887a Apr 25, 2025
117 of 120 checks passed

robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 25, 2025

pipe01 added a commit to BetterStackHQ/ClickHouse that referenced this pull request Apr 30, 2025

Apply PR ClickHouse#77326

128641a

devcrafter mentioned this pull request Jun 18, 2025

filesystem_cache with s3Cluster #72816

Closed

ianton-ru mentioned this pull request Jun 24, 2025

Rendezvous hashing filesystem cache in (s3/etc)Cluster functions #82511

Merged

1 task

thevar1able mentioned this pull request Aug 1, 2025

s3Cluster is suboptimal when the number of files is comparable to the number of servers #70190

Closed

	LoggerPtr log = getLogger("StorageObjectStorageStableTaskDistributor");
	LoggerPtr log = getLogger("StorageClusterTaskDistributor");


		std::shared_ptr<IObjectIterator> iterator;

		std::unordered_map<size_t, std::vector<String>> connection_to_files;

Conversation

adikus commented Mar 7, 2025 • edited by nickitat Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh bot commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adikus commented Mar 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adikus commented Mar 14, 2025

Uh oh!

alexey-milovidov commented Mar 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adikus commented Apr 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adikus commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adikus commented Apr 2, 2025

Uh oh!

adikus commented Apr 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

adikus commented Mar 7, 2025 •

edited by nickitat

Loading

clickhouse-gh bot commented Mar 10, 2025 •

edited

Loading

adikus commented Apr 1, 2025 •

edited

Loading