Skip to content

Refactor FormatParserGroup to split resources management and filter condition logic#83997

Merged
divanik merged 23 commits intomasterfrom
divanik/fix_parser_group_usage_with_schema_evolution
Jul 30, 2025
Merged

Refactor FormatParserGroup to split resources management and filter condition logic#83997
divanik merged 23 commits intomasterfrom
divanik/fix_parser_group_usage_with_schema_evolution

Conversation

@divanik
Copy link
Member

@divanik divanik commented Jul 18, 2025

Changelog category:

Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Splits FormatParserGroup on two independent structs, the first one is responsible for shared compute and IO resources, the second one is responsible for shared filter resources (filter ActionDag, KeyCondition). This is done for more flexible shared usage of these structures by different threads

@clickhouse-gh
Copy link
Contributor

clickhouse-gh bot commented Jul 18, 2025

Workflow [PR], commit [41377be]

Summary:

job_name test_name status info comment
Stateless tests (amd_asan, distributed plan, sequential) failure
00002_log_and_exception_messages_formatting FAIL
Stateless tests (amd_binary, ParallelReplicas, s3 storage, parallel) failure
Logical error thrown (see clickhouse-server.log or logical_errors.txt) FAIL
Stress test (amd_debug) failure
Server died FAIL
Hung check failed, possible deadlock found (see hung_check.log) FAIL
Killed by signal (in clickhouse-server.log) FAIL
Fatal message in clickhouse-server.log (see fatal_messages.txt) FAIL
Killed by signal (output files) FAIL
Found signal in gdb.log FAIL
Stress test (amd_ubsan) failure
Server died FAIL
Hung check failed, possible deadlock found (see hung_check.log) FAIL
Killed by signal (in clickhouse-server.log) FAIL
Fatal message in clickhouse-server.log (see fatal_messages.txt) FAIL
Killed by signal (output files) FAIL
Found signal in gdb.log FAIL

@divanik divanik requested a review from al13n321 July 18, 2025 17:29
max_block_size,
format_settings,
parser_group,
!schema_was_changed ? parser_group : nullptr,
Copy link
Member

@al13n321 al13n321 Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the story here is:

Different files may have different schemas (because they were written before some ALTERs). But the FormatParserGroup thing doesn't support that, it assumes that all files read by one query share things like KeyCondition, PrewhereInfo (soon), and ColumnMapper. So here we use a shared FormatParserGroup only for files with latest schema, while each old-schema file gets its own separate FormatParserGroup with no KeyCondition. This fixes correctness but breaks resource sharing among the file readers (shared thread count limits max_parsing_threads/max_io_threads/parsing_runner/io_runner, soon shared memory limits in ParserGroupExt in #82789 ).

But:

  • Need to assign column_mapper in per-file parser groups?
  • Removing KeyCondition makes ALTER a big performance landmine. E.g. I imagine people often do queries with WHERE timestamp BETWEEN [some short range] and rely on the parquet filter pushdown (using row group min/max stats) to read only a small fraction of all data. Then they do a small ALTER and boom all queries are full scans now, the service is unusable, and there's no (?) way to undo it. I don't know what to do about it. I guess we should add logic for detecting whether a given sequence of ALTERs interferes with a given KeyCondition. E.g. if none of the columns in the KeyCondition were touched by the ALTERs, keep it as is. (And ideally: if only some of the columns used in KeyCondition were touched, remove those columns but leave the rest of the KeyCondition intact.). (... How does MergeTree solve this? oh, it just doesn't allow ALTERing the primary key; we can't do that here.)
  • Would be better to properly support different schemas in FormatParserGroup, so that resource limits are shared correctly. Maybe split FormatParserGroup into two things passed to IInputFormat separately (both by shared_ptr): (1) struct with thread/memory limits and thread pools ("FormatParserSharedResources"?), (2) struct with schema, key condition, prewhere info (after Parquet reader v3 #82789), column_mapper; may make sense to directly use existing struct ReadFromFormatInfo.
  • After Parquet reader v3 #82789 , FormatParserGroup will have prewhere info, which is not optional. If during query analysis we returned true from supportsPrewhere(), the storage must do the prewhere filtering, there's no other pipeline step that will do it. So I guess we should somehow disable prewhere if files have different schemas. It doesn't seem easy to do because currently the supportsPrewhere() check is not tied to a storage snapshot, so the "files have different schemas" condition may change between the check and the read. How does normal MergeTree deal with this? I.e. if you run an alter to swap two columns, and while it's running you do a SELECT with PREWHERE on one of these columns, and only a subset of parts have been mutated, how does this PREWHERE work on the non-mutated parts? How does it know to swap the columns before evaluating the expression? Hopefully we can reuse that in parquet reader v3, somehow? (I guess this is mostly my problem as the author of Parquet reader v3 #82789 , but maybe you already know some answers, that would be helpful.)

(Btw, how does schema evolution work on the read path? Does it add something to the query pipeline to run transformations corresponding to the history of ALTERs? E.g. if two columns' names were swapped, we have to read them from the format, then swap their names in the block - there's some code that does this swapping, right? I wish there were comments explaining it and pointing to that code. How does this interact with KeyCondition stuff? The condition would have to be somehow adjusted to work on pre-ALTER data, e.g. with the two columns swapped? I guess we don't have code for such adjustment and should instead clear the KeyCondition and rely on normal WHERE step for filtering?)

Copy link
Member Author

@divanik divanik Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, thanks, I intentionally assigned you because you introduced the logic, and I really needed your opinion to understand what I can break by this fix. We need to recover correctness as soon as possible and backport the bug, so I need your help.

This fixes correctness but breaks resource sharing among the file readers (shared thread count limits max_parsing_threads/max_io_threads/parsing_runner/io_runner, soon shared memory limits in ParserGroupExt in #82789 ).

Are there any suggestions how to overcome this?

Need to assign column_mapper in per-file parser groups?

I don't know what the column_mapper is, need to read some code, will comment on this.

Removing KeyCondition makes ALTER a big performance landmine. E.g. I imagine people often do queries with WHERE timestamp BETWEEN [some short range] and rely on the parquet filter pushdown (using row group min/max stats) to read only a small fraction of all data. Then they do a small ALTER and boom all queries are full scans now, the service is unusable, and there's no (?) way to undo it.

But the broken correctness is more important than performance issues, we can discuss performance here in a separate issue. We had full scans in this place before your recent PR with FormatParserGroup, so reintroducing it here is not a tragedy, imho.

I guess we should add logic for detecting whether a given sequence of ALTERs interferes with a given KeyCondition.

For Iceberg seems doable, let's discuss it in a separate issue.

Would be better to properly support different schemas in FormatParserGroup, so that resource limits are shared correctly. Maybe split FormatParserGroup into two things passed to IInputFormat separately (both by shared_ptr): (1) struct with thread/memory limits and thread pools ("FormatParserSharedResources"?), (2) struct with schema, key condition, prewhere info (after #82789), column_mapper; may make sense to directly use existing struct

Seems good and as an answer hot to unblock the current fix, can you implement it in several days to actually unblock it?

so the "files have different schemas" condition may change between the check and the read

In Iceberg, data files are immutable, file reading schema is tightly connected with data file from its creation. We do not support writing to Iceberg yet as well as DDL-queries. You can consider now that for each select query we define final schema of select query once and do transformations per-file (to match data in file from original schema to a final one). I do not properly know how read-write concurrency works in MT, but I think that current Iceberg architecture more or less can provide "atomicity" here. Either alter query will affect all files in query or no files.

How does this interact with KeyCondition stuff?
I think there are no optimisations in this code which interact with key condition for now (at least I didn't know about them before your PR) and we counted on where condition and some partition pruning stuff.

there's some code that does this swapping, right?

Yes, here it is (the chain of logic which is responsible for it):
https://github.com/ClickHouse/ClickHouse/blob/master/src/Storages/ObjectStorage/StorageObjectStorageSource.cpp#L507-L517

std::shared_ptr<const ActionsDAG> IcebergMetadata::getSchemaTransformer(ContextPtr local_context, const String & data_path) const

std::shared_ptr<const ActionsDAG> IcebergSchemaProcessor::getSchemaTransformationDagByIds(Int32 old_id, Int32 new_id)
{
if (old_id == new_id)
return nullptr;
std::lock_guard lock(mutex);
auto required_transform_dag_it = transform_dags_by_ids.find({old_id, new_id});
if (required_transform_dag_it != transform_dags_by_ids.end())
return required_transform_dag_it->second;
auto old_schema_it = iceberg_table_schemas_by_ids.find(old_id);
if (old_schema_it == iceberg_table_schemas_by_ids.end())
throw Exception(ErrorCodes::BAD_ARGUMENTS, "Schema with schema-id {} is unknown", old_id);
auto new_schema_it = iceberg_table_schemas_by_ids.find(new_id);
if (new_schema_it == iceberg_table_schemas_by_ids.end())
throw Exception(ErrorCodes::BAD_ARGUMENTS, "Schema with schema-id {} is unknown", new_id);
return transform_dags_by_ids[{old_id, new_id}]
= getSchemaTransformationDag(old_schema_it->second, new_schema_it->second, old_id, new_id);
}

Overall, you raised very good questions. @alesapin, @kssenii could you take a look at how this stuff can affect delta lake correctness, please?

But for now let's create a plan, what we can do to remove correctness problem in this code. I will spend some time on understanding what column_mapper is and after that I will write a plan how I see the process of resolving this.

@al13n321 al13n321 mentioned this pull request Jul 18, 2025
@clickhouse-gh clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Jul 22, 2025
@divanik
Copy link
Member Author

divanik commented Jul 22, 2025

The plan:

  1. I can split threads logic and schema info in ParserFormatGroup by myself.
  2. After that I will reimplement the logic of this PR properly and we will merge it and backport to 25.7.
  3. I think we can simply glue transformation dag and filter dag together (this won't be backported and can be postponed as a performance improvement).

@divanik
Copy link
Member Author

divanik commented Jul 22, 2025

Also I can't say that I understand everything here:

After #82789 , FormatParserGroup will have prewhere info, which is not optional. If during query analysis we returned true from supportsPrewhere(), the storage must do the prewhere filtering, there's no other pipeline step that will do it. So I guess we should somehow disable prewhere if files have different schemas. It doesn't seem easy to do because currently the supportsPrewhere() check is not tied to a storage snapshot, so the "files have different schemas" condition may change between the check and the read. How does normal MergeTree deal with this? I.e. if you run an alter to swap two columns, and while it's running you do a SELECT with PREWHERE on one of these columns, and only a subset of parts have been mutated, how does this PREWHERE work on the non-mutated parts? How does it know to swap the columns before evaluating the expression? Hopefully we can reuse that in parquet reader v3, somehow? (I guess this is mostly my problem as the author of #82789 , but maybe you already know some answers, that would be helpful.)

(Btw, how does schema evolution work on the read path? Does it add something to the query pipeline to run transformations corresponding to the history of ALTERs? E.g. if two columns' names were swapped, we have to read them from the format, then swap their names in the block - there's some code that does this swapping, right? I wish there were comments explaining it and pointing to that code. How does this interact with KeyCondition stuff? The condition would have to be somehow adjusted to work on pre-ALTER data, e.g. with the two columns swapped? I guess we don't have code for such adjustment and should instead clear the KeyCondition and rely on normal WHERE step for filtering?)

I think that concurrency problems are orthogonal to the logic in this place (it is handled at another place and primarily data lake metadata level, here we are talking about data parsing). With PREWHERE I don't have enough context to understand the problem, tbh

@divanik divanik marked this pull request as ready for review July 23, 2025 14:04
@divanik
Copy link
Member Author

divanik commented Jul 24, 2025

Sorry, @al13n321, I thought that this bug was introduced by your pr about parser_groups, but I forgot about this line. It seems this bug was always present. It seems that this fix indeed can lead to severe performance degradation.

@divanik
Copy link
Member Author

divanik commented Jul 31, 2025

Want to try backportint it to 25.7 if it is not very difficult

@abashkeev
Copy link
Member

Do we need it in 25.7 or we can do it in 25.8 in a regular way?

@divanik
Copy link
Member Author

divanik commented Jul 31, 2025

@abashkeev, I initially attempted to backport this only because it's a prerequisite PR for a bug fix, and the release was made just ten days ago. However, after considering your question, I've reassessed the situation and determined that backporting this future fix to versions ≤25.7 would be very tedious. Therefore, let's close all backports for now.

Also Nikita said that we are not going to release 25.7 to cloud, this is another reason to close backports

@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Jul 31, 2025
baibaichen pushed a commit to Kyligence/ClickHouse that referenced this pull request Aug 7, 2025
Fix rebase issue:
- 20250806 ClickHouse#84821
- 20250804 ClickHouse#83997
- 20250728 ClickHouse#84180
- 20250713 ClickHouse#82949
- 20250703 ClickHouse#82934
- 20250626 ClickHouse#80931
- 20250604 ClickHouse#79649
- 20250502 ClickHouse#79180
- 20250416 ClickHouse#78485
- 20250306 ClickHouse#76662

Co-authored-by: liuneng1994 <[email protected]>
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 7, 2025
baibaichen pushed a commit to Kyligence/ClickHouse that referenced this pull request Aug 8, 2025
Fix rebase issue:
- 20250806 ClickHouse#84821
- 20250804 ClickHouse#83997
- 20250728 ClickHouse#84180
- 20250713 ClickHouse#82949
- 20250703 ClickHouse#82934
- 20250626 ClickHouse#80931
- 20250604 ClickHouse#79649
- 20250502 ClickHouse#79180
- 20250416 ClickHouse#78485
- 20250306 ClickHouse#76662

Co-authored-by: liuneng1994 <[email protected]>
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 8, 2025
baibaichen pushed a commit to Kyligence/ClickHouse that referenced this pull request Aug 11, 2025
Fix rebase issue:
- 20250806 ClickHouse#84821
- 20250804 ClickHouse#83997
- 20250728 ClickHouse#84180
- 20250713 ClickHouse#82949
- 20250703 ClickHouse#82934
- 20250626 ClickHouse#80931
- 20250604 ClickHouse#79649
- 20250502 ClickHouse#79180
- 20250416 ClickHouse#78485
- 20250306 ClickHouse#76662

Co-authored-by: liuneng1994 <[email protected]>
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 11, 2025
baibaichen pushed a commit to Kyligence/ClickHouse that referenced this pull request Aug 12, 2025
Fix rebase issue:
- 20250806 ClickHouse#84821
- 20250804 ClickHouse#83997
- 20250728 ClickHouse#84180
- 20250713 ClickHouse#82949
- 20250703 ClickHouse#82934
- 20250626 ClickHouse#80931
- 20250604 ClickHouse#79649
- 20250502 ClickHouse#79180
- 20250416 ClickHouse#78485
- 20250306 ClickHouse#76662

Co-authored-by: liuneng1994 <[email protected]>
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 12, 2025
baibaichen pushed a commit to Kyligence/ClickHouse that referenced this pull request Aug 13, 2025
Fix rebase issue:
- 20250806 ClickHouse#84821
- 20250804 ClickHouse#83997
- 20250728 ClickHouse#84180
- 20250713 ClickHouse#82949
- 20250703 ClickHouse#82934
- 20250626 ClickHouse#80931
- 20250604 ClickHouse#79649
- 20250502 ClickHouse#79180
- 20250416 ClickHouse#78485
- 20250306 ClickHouse#76662

Co-authored-by: liuneng1994 <[email protected]>
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 13, 2025
baibaichen pushed a commit to Kyligence/ClickHouse that referenced this pull request Aug 14, 2025
Fix rebase issue:
- 20250806 ClickHouse#84821
- 20250804 ClickHouse#83997
- 20250728 ClickHouse#84180
- 20250713 ClickHouse#82949
- 20250703 ClickHouse#82934
- 20250626 ClickHouse#80931
- 20250604 ClickHouse#79649
- 20250502 ClickHouse#79180
- 20250416 ClickHouse#78485
- 20250306 ClickHouse#76662

Co-authored-by: liuneng1994 <[email protected]>
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 14, 2025
baibaichen pushed a commit to Kyligence/ClickHouse that referenced this pull request Aug 15, 2025
Fix rebase issue:
- 20250806 ClickHouse#84821
- 20250804 ClickHouse#83997
- 20250728 ClickHouse#84180
- 20250713 ClickHouse#82949
- 20250703 ClickHouse#82934
- 20250626 ClickHouse#80931
- 20250604 ClickHouse#79649
- 20250502 ClickHouse#79180
- 20250416 ClickHouse#78485
- 20250306 ClickHouse#76662

Co-authored-by: liuneng1994 <[email protected]>
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 15, 2025
baibaichen pushed a commit to Kyligence/ClickHouse that referenced this pull request Aug 16, 2025
Fix rebase issue:
- 20250806 ClickHouse#84821
- 20250804 ClickHouse#83997
- 20250728 ClickHouse#84180
- 20250713 ClickHouse#82949
- 20250703 ClickHouse#82934
- 20250626 ClickHouse#80931
- 20250604 ClickHouse#79649
- 20250502 ClickHouse#79180
- 20250416 ClickHouse#78485
- 20250306 ClickHouse#76662

Co-authored-by: liuneng1994 <[email protected]>
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 16, 2025
baibaichen pushed a commit to Kyligence/ClickHouse that referenced this pull request Aug 18, 2025
Fix rebase issue:
- 20250806 ClickHouse#84821
- 20250804 ClickHouse#83997
- 20250728 ClickHouse#84180
- 20250713 ClickHouse#82949
- 20250703 ClickHouse#82934
- 20250626 ClickHouse#80931
- 20250604 ClickHouse#79649
- 20250502 ClickHouse#79180
- 20250416 ClickHouse#78485
- 20250306 ClickHouse#76662

Co-authored-by: liuneng1994 <[email protected]>
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-backports-created-cloud deprecated label, NOOP pr-improvement Pull request with some product improvements pr-must-backport Pull request should be backported intentionally. Use this label with great care! pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants