Rendezvous hashing filesystem cache in (s3/etc)Cluster functions#82511
Conversation
|
Workflow [PR], commit [0e3b8ef] Summary: ⏳
|
| for obj in minio.list_objects(cluster.minio_bucket, recursive=True): | ||
| print(obj.object_name) |
| try: | ||
| cluster = ClickHouseCluster(__file__) | ||
| # clickhouse0 not a member of cluster_XXX | ||
| for i in range(6): |
There was a problem hiding this comment.
What's the purpose of adding it here in the code vs. doing it via the config?
There was a problem hiding this comment.
You means config like this one?
I think this cycle is more simple than a cycle through a list like ["clickhouse0", "clickhouse1", ... , "clickhouse5"].
| return int(s3_get_first), int(s3_get_second) | ||
|
|
||
|
|
||
| def check_s3_gets_repeat(cluster, node, expected_result, cluster_first, cluster_second, enable_filesystem_cache): |
There was a problem hiding this comment.
This test looks too heavy (duration-wise):
and a little too complex. We want to test that more or less the same files will be scheduled on each node. So, maybe let's just run some queries with different cluster configurations and then check that each replica has ~const * number_of_files / number_of_replicas + eps files in its cache.
There was a problem hiding this comment.
It's a complex thing.
Base logic in StorageObjectStorageStableTaskDistributor (before this PR too, I did not touch this part) is:
- calculate "primary" replica for each file depends of file path and replicas info (replica's number or replica's address).
- but if some replica finished to process all "own" files early, it gets unprocessed files, which has other replicas as "primary".
As result, files in head of lists have a near 100% chance to be processed on theirs "primary" replicas, when files from tail have a great chance to be caught by "non-primary" replicas. I can't control speed of execution, and in single run this tails can be randomly great, and test may fail. So I made this code with several runs and average result to minimize chance of this kind of random fail.
Possible way to make distribution predictable is to remove ability to process file on "non-primary" replica, but this is definitely not for production - if some replica is overloaded or lost, query will fail with timeout in this case.
May be possible to use something like FailPoints but without exception to turn this off for test purpose only, but I'm not sure that this is a correct way.
There was a problem hiding this comment.
Could we test this logic with a unit test?
There was a problem hiding this comment.
Pls continue this pr, it makes a lot of sense
There was a problem hiding this comment.
@nickitat Sorry, summer vacation time. Added some unit tests in src/Storages/ObjectStorage/tests/gtest_rendezvous_hashing.cpp
There was a problem hiding this comment.
Do you think we still need to have tests/integration/test_s3_cache_locality/test.py?
There was a problem hiding this comment.
Ok, let's remove it.
| : iterator(std::move(iterator_)) | ||
| , connection_to_files(number_of_replicas_) | ||
| , connection_to_files(ids_of_nodes_.size()) | ||
| , ids_of_nodes(ids_of_nodes_) |
|
Stateless test 02443_detach_attach_partition failed |
fa5747d
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Rendezvous hashing for improve cache locality.
Documentation entry for user-facing changes
Improvement of #77326.
In original PR "primary" replica for object is selected on base of node number in cluster.
ConsistentHashing(sipHash64(file_path), connection_to_files.size());In this logic distribution by nodes is consistent when new node gets maximal number or node with maximum number removed.
But node number is not linked to node, and new node can be inserted in the middle of the list, in this case all nodes below changes their numbers.
For example see Cluster::Cluster(Cluster::ReplicasAsShardsTag, ...).
This PR makes distribution more consistent in this case. Nodes are selected based on their host:port with rendezvous hashing algorythm.