Allow to attach partition from table with different partition expression when destination partition expression doesn't re-partition by arthurpassos · Pull Request #39507 · ClickHouse/ClickHouse

arthurpassos · 2022-07-22T17:14:21Z

Algorithm is divided into 2 steps:

Partition compatibility check;
Clone parts and update files;

Partition compatibility check
In order for 2 partition expressions to be considered compatible, they must meet one of the following:

Destination partition expressions must be a subset of source partition expressions; or
Destination partition expression must be monotonically increasing in the source partition min max range.

Monotonicity check does not work for functions with more than 1 variable argument. Therefore, it is a must that all expressions in the partition expression contain at most 1 variable argument.

Clone

Hard link all files as usual.
Delete/ unlink checksums.txt, partition.dat and minmax_<partition_exp_columns>.idx (dst table only).
Recalculate those files and write them back in the destination folder.
Do rest as usual.

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Allow partitions from tables with different partition expressions to be attached when the destination table partition expression doesn't re-partition/ split the part.

Closes #13826

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

…ition

src/Storages/MergeTree/IMergeTreeDataPart.cpp

src/Storages/MergeTree/MergeTreeData.cpp

src/Storages/MergeTree/MergeTreePartition.cpp

src/Storages/StorageMergeTree.cpp

arthurpassos · 2022-07-25T20:09:46Z

Current implementation doesn't cover attaching to non partitioned table because MergeTreeData::checkAlterTableIsPossible fails to call getPartitionIDFromQuery. getPartitionIDFromQuery throws when the number of columns of the current table partition expression differ from the number of columns in the command partition expression.

It looks like adding something like the below in the MergeTreeData::checkAlterTableIsPossible for loop could solve the issue, tho I don't know all the consequences of doing it:

auto attaching_partition_to_non_partitioned_table = command.type == PartitionCommand::ATTACH_PARTITION && !getInMemoryMetadataPtr()->hasPartitionKey();
if (attaching_partition_to_non_partitioned_table)
{
    continue;
}

Plus, when issuing ALTER TABLE ... ATTACH PARTITION ..., command.type is PartitionCommand::REPLACE_PARTITION and not PartitionCommand::ATTACH_PARTITION. It would be great to get an explanation on that.

Attaching to non partitioned table was requested in this issue comment.

UPDATE: getPartitionIDFromQuery is called in another context as well, so bypassing the first call doesn't fully solve the problem. Needs more investigation.

UPDATE: This feature will not be included in this PR.

UPDATE: This feature is now supported :).

tavplubix · 2024-01-14T18:43:20Z

src/Storages/StorageReplicatedMergeTree.cpp

+            if (part_desc->src_table_part->info.partition_id != part_desc->new_part_info.partition_id)
+            {
+                auto src_part = part_desc->src_table_part;
+                const auto & src_storage = src_part->storage;
+
+                auto metadata_manager = std::make_shared<PartMetadataManagerOrdinary>(src_part.get());
+                IMergeTreeDataPart::MinMaxIndex min_max_index;
+
+                min_max_index.load(src_part->storage, metadata_manager);
+
+                MergeTreePartition new_partition;
+
+                new_partition.create(metadata_snapshot, min_max_index.getBlock(src_storage), 0u, getContext());
+
+                /// This will generate unique name in scope of current server process.
+                Int64 temp_index = insert_increment.get();
+
+                auto partition_id = new_partition.getID(*this);
+
+                MergeTreePartInfo dst_part_info(partition_id, temp_index, temp_index, src_part->info.level);
+
+                auto [res_part, temporary_part_lock] = cloneAndLoadPartOnSameDiskWithDifferentPartitionKey(
+                    src_part,
+                    TMP_PREFIX + "clone_",
+                    part_desc->new_part_info,
+                    metadata_snapshot,
+                    new_partition,
+                    min_max_index,
+                    clone_params,
+                    getContext()->getReadSettings(),
+                    getContext()->getWriteSettings());
+
+                part_desc->res_part = std::move(res_part);
+                part_desc->temporary_part_lock = std::move(temporary_part_lock);
+            }
+            else
+            {
+                auto [res_part, temporary_part_lock] = cloneAndLoadDataPartOnSameDisk(
+                    part_desc->src_table_part,
+                    TMP_PREFIX + "clone_",
+                    part_desc->new_part_info,
+                    metadata_snapshot,
+                    clone_params,
+                    getContext()->getReadSettings(),
+                    getContext()->getWriteSettings());
+
+                part_desc->res_part = std::move(res_part);
+                part_desc->temporary_part_lock = std::move(temporary_part_lock);
+            }


It would be nice to avoid copy-paste (looks like it's exactly the same for MergeTree and ReplicatedMergeTree)

@tavplubix

Where do you suggest I put this code? Two new overloads for MergeTreeData::cloneAndLoadDataPartOnSameDisk and MergeTreeData::cloneAndLoadPartOnSameDiskWithDifferentPartitionKey or inside MergeTreeDataPartCloner. I think I would go with the former

I don't have a strong opinion on this

@tavplubix

I have introduced a method in the base class that encapsulates the logic that preceeds cloneAndLoadPartOnSameDiskWithDifferentPartitionKey. This change removed some duplication, but not everything.

The decision of which method to call cloneAndLoadDataPartOnSameDisk or cloneAndLoadPartOnSameDiskWithDifferentPartitionKey still happens on caller site. The main reason for this was to avoid having a boolean parameter that would control internal behavior.

If you consider it a must, I can implement it.

tavplubix · 2024-01-14T18:44:14Z

src/Storages/StorageReplicatedMergeTree.cpp

+            if (is_partition_exp_different)
+            {
+                auto metadata_manager = std::make_shared<PartMetadataManagerOrdinary>(src_part.get());
+                IMergeTreeDataPart::MinMaxIndex min_max_index;
+
+                min_max_index.load(src_data, metadata_manager);
+
+                MergeTreePartition new_partition;
+
+                new_partition.create(metadata_snapshot, min_max_index.getBlock(src_data), 0u, getContext());
+
+                partition_id = new_partition.getID(*this);
+
+                MergeTreePartInfo dst_part_info(partition_id, index, index, src_part->info.level);
+
+                auto [dst_part, part_lock] = cloneAndLoadPartOnSameDiskWithDifferentPartitionKey(
+                    src_part,
+                    TMP_PREFIX,
+                    dst_part_info,
+                    metadata_snapshot,
+                    new_partition,
+                    min_max_index,
+                    clone_params,
+                    query_context->getReadSettings(),
+                    query_context->getWriteSettings());
+
+                dst_parts.emplace_back(dst_part);
+                dst_parts_locks.emplace_back(std::move(part_lock));
+            }
+            else
+            {
+                MergeTreePartInfo dst_part_info(partition_id, index, index, src_part->info.level);
+
+                auto [dst_part, part_lock] = cloneAndLoadDataPartOnSameDisk(
+                    src_part,
+                    TMP_PREFIX,
+                    dst_part_info,
+                    metadata_snapshot,
+                    clone_params,
+                    query_context->getReadSettings(),
+                    query_context->getWriteSettings());
+
+                dst_parts.emplace_back(dst_part);
+                dst_parts_locks.emplace_back(std::move(part_lock));
+            }


tavplubix · 2024-01-14T18:48:37Z

tests/integration/test_attach_partition_distinct_expression_replicated/test.py

+def create_table(node, table_name, replicated):
+    replica = node.name
+    engine = (
+        f"ReplicatedMergeTree('/clickhouse/tables/1/{table_name}', '{replica}')"
+        if replicated
+        else "MergeTree()"


JFYI It's easy to write functional tests with multiple replicas, you can create them like:

create table t1 ... engine=ReplicatedMergeTree('/clickhouse/tables/{database}/{table_name}', '1'); create table t2 ... engine=ReplicatedMergeTree('/clickhouse/tables/{database}/{table_name}', '2');

I did introduce some stateless tests for replicated merge tree: https://github.com/ClickHouse/ClickHouse/pull/39507/files#diff-2285549b3c6a98c7292c0dd01646ea60e35a6af06683f52ee9a57144ca0af100R313.

I don't quite remember the reasoning behind these integration, but I would guess at the time I was trying to validate it also worked on different servers / clusters?

src/Storages/StorageReplicatedMergeTree.cpp

arthurpassos · 2024-01-16T19:24:31Z

@tavplubix I have updated the StorageReplicatedMergeTree code to use the "new" partition id to allocate blocks in zookeeper instead of source partition id. To get the new partition id, MinMaxIndex and MergeTreePartition need to be created beforehand (it used to happen inside cloneAndLoad...).

In order to avoid performing this task twice, I refactored the code so cloneAndLoad... take those as arguments.

tavplubix · 2024-01-17T19:23:44Z

Stateless tests (release, analyzer) - the new test has failed

…h to decide if it is a different partition exp

arthurpassos · 2024-01-20T15:57:24Z

@tavplubix hi. Can you merge it?

tavplubix · 2024-01-22T19:58:15Z

Integration tests (tsan) [4/6] - #59082
Stateful tests (debug, ParallelReplicas) - #59078
Stateless tests (release, analyzer) - #57418
Stateless tests (tsan) [2/5] - #37187 (comment)
Upgrade check (debug) - #58402 (comment)

Slach · 2024-01-23T12:17:25Z

@arthurpassos contrats!

arthurpassos added 11 commits July 21, 2022 12:58

temp commit

2ee716e

temp commit

c913dcf

draft impl for feedback

d633bd1

Merge branch 'ClickHouse:master' into allow_attach_more_granular_part…

a2e9082

…ition

fix weird style changes

b38045d

fix weird style changes

456313b

fix weird style changes

f61eab2

fix weird style changes

7946494

fix weird style changes

70d5b41

aa

b850783

aa

68af6d4

robot-clickhouse added the pr-feature Pull request with new product feature label Jul 22, 2022