Skip to content

Allow to attach partition from table with different partition expression when destination partition expression doesn't re-partition#39507

Merged
tavplubix merged 170 commits intoClickHouse:masterfrom
arthurpassos:allow_attach_more_granular_partition
Jan 22, 2024
Merged

Allow to attach partition from table with different partition expression when destination partition expression doesn't re-partition#39507
tavplubix merged 170 commits intoClickHouse:masterfrom
arthurpassos:allow_attach_more_granular_partition

Conversation

@arthurpassos
Copy link
Contributor

@arthurpassos arthurpassos commented Jul 22, 2022

Algorithm is divided into 2 steps:

  1. Partition compatibility check;
  2. Clone parts and update files;

Partition compatibility check
In order for 2 partition expressions to be considered compatible, they must meet one of the following:

  1. Destination partition expressions must be a subset of source partition expressions; or
  2. Destination partition expression must be monotonically increasing in the source partition min max range.

Monotonicity check does not work for functions with more than 1 variable argument. Therefore, it is a must that all expressions in the partition expression contain at most 1 variable argument.

Clone

  1. Hard link all files as usual.
  2. Delete/ unlink checksums.txt, partition.dat and minmax_<partition_exp_columns>.idx (dst table only).
  3. Recalculate those files and write them back in the destination folder.
  4. Do rest as usual.

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Allow partitions from tables with different partition expressions to be attached when the destination table partition expression doesn't re-partition/ split the part.

Closes #13826

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

@robot-clickhouse robot-clickhouse added the pr-feature Pull request with new product feature label Jul 22, 2022
@arthurpassos
Copy link
Contributor Author

arthurpassos commented Jul 25, 2022

Current implementation doesn't cover attaching to non partitioned table because MergeTreeData::checkAlterTableIsPossible fails to call getPartitionIDFromQuery. getPartitionIDFromQuery throws when the number of columns of the current table partition expression differ from the number of columns in the command partition expression.

It looks like adding something like the below in the MergeTreeData::checkAlterTableIsPossible for loop could solve the issue, tho I don't know all the consequences of doing it:

auto attaching_partition_to_non_partitioned_table = command.type == PartitionCommand::ATTACH_PARTITION && !getInMemoryMetadataPtr()->hasPartitionKey();
if (attaching_partition_to_non_partitioned_table)
{
    continue;
}

Plus, when issuing ALTER TABLE ... ATTACH PARTITION ..., command.type is PartitionCommand::REPLACE_PARTITION and not PartitionCommand::ATTACH_PARTITION. It would be great to get an explanation on that.

Attaching to non partitioned table was requested in this issue comment.

UPDATE: getPartitionIDFromQuery is called in another context as well, so bypassing the first call doesn't fully solve the problem. Needs more investigation.

UPDATE: This feature will not be included in this PR.

UPDATE: This feature is now supported :).

@arthurpassos arthurpassos marked this pull request as ready for review July 26, 2022 13:50
@Felixoid Felixoid added the can be tested Allows running workflows for external contributors label Jul 29, 2022
Comment on lines +2718 to +2766
if (part_desc->src_table_part->info.partition_id != part_desc->new_part_info.partition_id)
{
auto src_part = part_desc->src_table_part;
const auto & src_storage = src_part->storage;

auto metadata_manager = std::make_shared<PartMetadataManagerOrdinary>(src_part.get());
IMergeTreeDataPart::MinMaxIndex min_max_index;

min_max_index.load(src_part->storage, metadata_manager);

MergeTreePartition new_partition;

new_partition.create(metadata_snapshot, min_max_index.getBlock(src_storage), 0u, getContext());

/// This will generate unique name in scope of current server process.
Int64 temp_index = insert_increment.get();

auto partition_id = new_partition.getID(*this);

MergeTreePartInfo dst_part_info(partition_id, temp_index, temp_index, src_part->info.level);

auto [res_part, temporary_part_lock] = cloneAndLoadPartOnSameDiskWithDifferentPartitionKey(
src_part,
TMP_PREFIX + "clone_",
part_desc->new_part_info,
metadata_snapshot,
new_partition,
min_max_index,
clone_params,
getContext()->getReadSettings(),
getContext()->getWriteSettings());

part_desc->res_part = std::move(res_part);
part_desc->temporary_part_lock = std::move(temporary_part_lock);
}
else
{
auto [res_part, temporary_part_lock] = cloneAndLoadDataPartOnSameDisk(
part_desc->src_table_part,
TMP_PREFIX + "clone_",
part_desc->new_part_info,
metadata_snapshot,
clone_params,
getContext()->getReadSettings(),
getContext()->getWriteSettings());

part_desc->res_part = std::move(res_part);
part_desc->temporary_part_lock = std::move(temporary_part_lock);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to avoid copy-paste (looks like it's exactly the same for MergeTree and ReplicatedMergeTree)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tavplubix

Where do you suggest I put this code? Two new overloads for MergeTreeData::cloneAndLoadDataPartOnSameDisk and MergeTreeData::cloneAndLoadPartOnSameDiskWithDifferentPartitionKey or inside MergeTreeDataPartCloner. I think I would go with the former

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion on this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tavplubix

I have introduced a method in the base class that encapsulates the logic that preceeds cloneAndLoadPartOnSameDiskWithDifferentPartitionKey. This change removed some duplication, but not everything.

The decision of which method to call cloneAndLoadDataPartOnSameDisk or cloneAndLoadPartOnSameDiskWithDifferentPartitionKey still happens on caller site. The main reason for this was to avoid having a boolean parameter that would control internal behavior.

If you consider it a must, I can implement it.

Comment on lines +7997 to +8041
if (is_partition_exp_different)
{
auto metadata_manager = std::make_shared<PartMetadataManagerOrdinary>(src_part.get());
IMergeTreeDataPart::MinMaxIndex min_max_index;

min_max_index.load(src_data, metadata_manager);

MergeTreePartition new_partition;

new_partition.create(metadata_snapshot, min_max_index.getBlock(src_data), 0u, getContext());

partition_id = new_partition.getID(*this);

MergeTreePartInfo dst_part_info(partition_id, index, index, src_part->info.level);

auto [dst_part, part_lock] = cloneAndLoadPartOnSameDiskWithDifferentPartitionKey(
src_part,
TMP_PREFIX,
dst_part_info,
metadata_snapshot,
new_partition,
min_max_index,
clone_params,
query_context->getReadSettings(),
query_context->getWriteSettings());

dst_parts.emplace_back(dst_part);
dst_parts_locks.emplace_back(std::move(part_lock));
}
else
{
MergeTreePartInfo dst_part_info(partition_id, index, index, src_part->info.level);

auto [dst_part, part_lock] = cloneAndLoadDataPartOnSameDisk(
src_part,
TMP_PREFIX,
dst_part_info,
metadata_snapshot,
clone_params,
query_context->getReadSettings(),
query_context->getWriteSettings());

dst_parts.emplace_back(dst_part);
dst_parts_locks.emplace_back(std::move(part_lock));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Comment on lines +26 to +31
def create_table(node, table_name, replicated):
replica = node.name
engine = (
f"ReplicatedMergeTree('/clickhouse/tables/1/{table_name}', '{replica}')"
if replicated
else "MergeTree()"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JFYI It's easy to write functional tests with multiple replicas, you can create them like:

create table t1 ... engine=ReplicatedMergeTree('/clickhouse/tables/{database}/{table_name}', '1');
create table t2 ... engine=ReplicatedMergeTree('/clickhouse/tables/{database}/{table_name}', '2');

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did introduce some stateless tests for replicated merge tree: https://github.com/ClickHouse/ClickHouse/pull/39507/files#diff-2285549b3c6a98c7292c0dd01646ea60e35a6af06683f52ee9a57144ca0af100R313.

I don't quite remember the reasoning behind these integration, but I would guess at the time I was trying to validate it also worked on different servers / clusters?

@arthurpassos
Copy link
Contributor Author

@tavplubix I have updated the StorageReplicatedMergeTree code to use the "new" partition id to allocate blocks in zookeeper instead of source partition id. To get the new partition id, MinMaxIndex and MergeTreePartition need to be created beforehand (it used to happen inside cloneAndLoad...).

In order to avoid performing this task twice, I refactored the code so cloneAndLoad... take those as arguments.

@tavplubix
Copy link
Member

Stateless tests (release, analyzer) - the new test has failed

@arthurpassos
Copy link
Contributor Author

@tavplubix hi. Can you merge it?

@tavplubix
Copy link
Member

Integration tests (tsan) [4/6] - #59082
Stateful tests (debug, ParallelReplicas) - #59078
Stateless tests (release, analyzer) - #57418
Stateless tests (tsan) [2/5] - #37187 (comment)
Upgrade check (debug) - #58402 (comment)

@tavplubix tavplubix merged commit 24b8bbe into ClickHouse:master Jan 22, 2024
@Slach
Copy link
Contributor

Slach commented Jan 23, 2024

@arthurpassos contrats!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow to attach a part/partition from a table with more granular partitioning.