Refactor FormatParserGroup to split resources management and filter condition logic by divanik · Pull Request #83997 · ClickHouse/ClickHouse

divanik · 2025-07-18T16:31:14Z

Changelog category:

Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Splits FormatParserGroup on two independent structs, the first one is responsible for shared compute and IO resources, the second one is responsible for shared filter resources (filter ActionDag, KeyCondition). This is done for more flexible shared usage of these structures by different threads

…k/fix_parser_group_usage_with_schema_evolution

clickhouse-gh · 2025-07-18T16:31:39Z

Workflow [PR], commit [41377be]

Summary: ❌

job_name	test_name	status
Stateless tests (amd_asan, distributed plan, sequential)		failure
	00002_log_and_exception_messages_formatting	FAIL
Stateless tests (amd_binary, ParallelReplicas, s3 storage, parallel)		failure
	Logical error thrown (see clickhouse-server.log or logical_errors.txt)	FAIL
Stress test (amd_debug)		failure
	Server died	FAIL
	Hung check failed, possible deadlock found (see hung_check.log)	FAIL
	Killed by signal (in clickhouse-server.log)	FAIL
	Fatal message in clickhouse-server.log (see fatal_messages.txt)	FAIL
	Killed by signal (output files)	FAIL
	Found signal in gdb.log	FAIL
Stress test (amd_ubsan)		failure
	Server died	FAIL
	Hung check failed, possible deadlock found (see hung_check.log)	FAIL
	Killed by signal (in clickhouse-server.log)	FAIL
	Fatal message in clickhouse-server.log (see fatal_messages.txt)	FAIL
	Killed by signal (output files)	FAIL
	Found signal in gdb.log	FAIL

al13n321 · 2025-07-18T19:55:19Z

src/Storages/ObjectStorage/StorageObjectStorageSource.cpp

            max_block_size,
            format_settings,
-            parser_group,
+            !schema_was_changed ? parser_group : nullptr,


I guess the story here is:

Different files may have different schemas (because they were written before some ALTERs). But the FormatParserGroup thing doesn't support that, it assumes that all files read by one query share things like KeyCondition, PrewhereInfo (soon), and ColumnMapper. So here we use a shared FormatParserGroup only for files with latest schema, while each old-schema file gets its own separate FormatParserGroup with no KeyCondition. This fixes correctness but breaks resource sharing among the file readers (shared thread count limits max_parsing_threads/max_io_threads/parsing_runner/io_runner, soon shared memory limits in ParserGroupExt in #82789 ).

But:

Need to assign column_mapper in per-file parser groups?

Removing KeyCondition makes ALTER a big performance landmine. E.g. I imagine people often do queries with WHERE timestamp BETWEEN [some short range] and rely on the parquet filter pushdown (using row group min/max stats) to read only a small fraction of all data. Then they do a small ALTER and boom all queries are full scans now, the service is unusable, and there's no (?) way to undo it. I don't know what to do about it. I guess we should add logic for detecting whether a given sequence of ALTERs interferes with a given KeyCondition. E.g. if none of the columns in the KeyCondition were touched by the ALTERs, keep it as is. (And ideally: if only some of the columns used in KeyCondition were touched, remove those columns but leave the rest of the KeyCondition intact.). (... How does MergeTree solve this? oh, it just doesn't allow ALTERing the primary key; we can't do that here.)

Would be better to properly support different schemas in FormatParserGroup, so that resource limits are shared correctly. Maybe split FormatParserGroup into two things passed to IInputFormat separately (both by shared_ptr): (1) struct with thread/memory limits and thread pools ("FormatParserSharedResources"?), (2) struct with schema, key condition, prewhere info (after Parquet reader v3 #82789), column_mapper; may make sense to directly use existing struct ReadFromFormatInfo.

After Parquet reader v3 #82789 , FormatParserGroup will have prewhere info, which is not optional. If during query analysis we returned true from supportsPrewhere(), the storage must do the prewhere filtering, there's no other pipeline step that will do it. So I guess we should somehow disable prewhere if files have different schemas. It doesn't seem easy to do because currently the supportsPrewhere() check is not tied to a storage snapshot, so the "files have different schemas" condition may change between the check and the read. How does normal MergeTree deal with this? I.e. if you run an alter to swap two columns, and while it's running you do a SELECT with PREWHERE on one of these columns, and only a subset of parts have been mutated, how does this PREWHERE work on the non-mutated parts? How does it know to swap the columns before evaluating the expression? Hopefully we can reuse that in parquet reader v3, somehow? (I guess this is mostly my problem as the author of Parquet reader v3 #82789 , but maybe you already know some answers, that would be helpful.)

(Btw, how does schema evolution work on the read path? Does it add something to the query pipeline to run transformations corresponding to the history of ALTERs? E.g. if two columns' names were swapped, we have to read them from the format, then swap their names in the block - there's some code that does this swapping, right? I wish there were comments explaining it and pointing to that code. How does this interact with KeyCondition stuff? The condition would have to be somehow adjusted to work on pre-ALTER data, e.g. with the two columns swapped? I guess we don't have code for such adjustment and should instead clear the KeyCondition and rely on normal WHERE step for filtering?)

Yeap, thanks, I intentionally assigned you because you introduced the logic, and I really needed your opinion to understand what I can break by this fix. We need to recover correctness as soon as possible and backport the bug, so I need your help.

This fixes correctness but breaks resource sharing among the file readers (shared thread count limits max_parsing_threads/max_io_threads/parsing_runner/io_runner, soon shared memory limits in ParserGroupExt in #82789 ).

Are there any suggestions how to overcome this?

Need to assign column_mapper in per-file parser groups?

I don't know what the column_mapper is, need to read some code, will comment on this.

Removing KeyCondition makes ALTER a big performance landmine. E.g. I imagine people often do queries with WHERE timestamp BETWEEN [some short range] and rely on the parquet filter pushdown (using row group min/max stats) to read only a small fraction of all data. Then they do a small ALTER and boom all queries are full scans now, the service is unusable, and there's no (?) way to undo it.

But the broken correctness is more important than performance issues, we can discuss performance here in a separate issue. We had full scans in this place before your recent PR with FormatParserGroup, so reintroducing it here is not a tragedy, imho.

I guess we should add logic for detecting whether a given sequence of ALTERs interferes with a given KeyCondition.

For Iceberg seems doable, let's discuss it in a separate issue.

Would be better to properly support different schemas in FormatParserGroup, so that resource limits are shared correctly. Maybe split FormatParserGroup into two things passed to IInputFormat separately (both by shared_ptr): (1) struct with thread/memory limits and thread pools ("FormatParserSharedResources"?), (2) struct with schema, key condition, prewhere info (after #82789), column_mapper; may make sense to directly use existing struct

Seems good and as an answer hot to unblock the current fix, can you implement it in several days to actually unblock it?

so the "files have different schemas" condition may change between the check and the read

In Iceberg, data files are immutable, file reading schema is tightly connected with data file from its creation. We do not support writing to Iceberg yet as well as DDL-queries. You can consider now that for each select query we define final schema of select query once and do transformations per-file (to match data in file from original schema to a final one). I do not properly know how read-write concurrency works in MT, but I think that current Iceberg architecture more or less can provide "atomicity" here. Either alter query will affect all files in query or no files.

How does this interact with KeyCondition stuff?
I think there are no optimisations in this code which interact with key condition for now (at least I didn't know about them before your PR) and we counted on where condition and some partition pruning stuff.

there's some code that does this swapping, right?

Yes, here it is (the chain of logic which is responsible for it):
https://github.com/ClickHouse/ClickHouse/blob/master/src/Storages/ObjectStorage/StorageObjectStorageSource.cpp#L507-L517

ClickHouse/src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergMetadata.cpp

Line 422 in 8ffceee

std::shared_ptr<const ActionsDAG> IcebergMetadata::getSchemaTransformer(ContextPtr local_context, const String & data_path) const

ClickHouse/src/Storages/ObjectStorage/DataLakes/Iceberg/SchemaProcessor.cpp

Lines 404 to 424 in 8ffceee

std::shared_ptr<const ActionsDAG> IcebergSchemaProcessor::getSchemaTransformationDagByIds(Int32 old_id, Int32 new_id)

{

if (old_id == new_id)

return nullptr;

std::lock_guard lock(mutex);

auto required_transform_dag_it = transform_dags_by_ids.find({old_id, new_id});

if (required_transform_dag_it != transform_dags_by_ids.end())

return required_transform_dag_it->second;

auto old_schema_it = iceberg_table_schemas_by_ids.find(old_id);

if (old_schema_it == iceberg_table_schemas_by_ids.end())

throw Exception(ErrorCodes::BAD_ARGUMENTS, "Schema with schema-id {} is unknown", old_id);

auto new_schema_it = iceberg_table_schemas_by_ids.find(new_id);

if (new_schema_it == iceberg_table_schemas_by_ids.end())

throw Exception(ErrorCodes::BAD_ARGUMENTS, "Schema with schema-id {} is unknown", new_id);

return transform_dags_by_ids[{old_id, new_id}]

= getSchemaTransformationDag(old_schema_it->second, new_schema_it->second, old_id, new_id);

}

Overall, you raised very good questions. @alesapin, @kssenii could you take a look at how this stuff can affect delta lake correctness, please?

But for now let's create a plan, what we can do to remove correctness problem in this code. I will spend some time on understanding what column_mapper is and after that I will write a plan how I see the process of resolving this.

divanik · 2025-07-22T15:18:57Z

The plan:

I can split threads logic and schema info in ParserFormatGroup by myself.
After that I will reimplement the logic of this PR properly and we will merge it and backport to 25.7.
I think we can simply glue transformation dag and filter dag together (this won't be backported and can be postponed as a performance improvement).

divanik · 2025-07-22T15:26:20Z

Also I can't say that I understand everything here:

After #82789 , FormatParserGroup will have prewhere info, which is not optional. If during query analysis we returned true from supportsPrewhere(), the storage must do the prewhere filtering, there's no other pipeline step that will do it. So I guess we should somehow disable prewhere if files have different schemas. It doesn't seem easy to do because currently the supportsPrewhere() check is not tied to a storage snapshot, so the "files have different schemas" condition may change between the check and the read. How does normal MergeTree deal with this? I.e. if you run an alter to swap two columns, and while it's running you do a SELECT with PREWHERE on one of these columns, and only a subset of parts have been mutated, how does this PREWHERE work on the non-mutated parts? How does it know to swap the columns before evaluating the expression? Hopefully we can reuse that in parquet reader v3, somehow? (I guess this is mostly my problem as the author of #82789 , but maybe you already know some answers, that would be helpful.)

(Btw, how does schema evolution work on the read path? Does it add something to the query pipeline to run transformations corresponding to the history of ALTERs? E.g. if two columns' names were swapped, we have to read them from the format, then swap their names in the block - there's some code that does this swapping, right? I wish there were comments explaining it and pointing to that code. How does this interact with KeyCondition stuff? The condition would have to be somehow adjusted to work on pre-ALTER data, e.g. with the two columns swapped? I guess we don't have code for such adjustment and should instead clear the KeyCondition and rely on normal WHERE step for filtering?)

I think that concurrency problems are orthogonal to the logic in this place (it is handled at another place and primarily data lake metadata level, here we are talking about data parsing). With PREWHERE I don't have enough context to understand the problem, tbh

…k/fix_parser_group_usage_with_schema_evolution

divanik · 2025-07-24T17:42:19Z

Sorry, @al13n321, I thought that this bug was introduced by your pr about parser_groups, but I forgot about this line. It seems this bug was always present. It seems that this fix indeed can lead to severe performance degradation.

divanik · 2025-07-31T12:22:44Z

Want to try backportint it to 25.7 if it is not very difficult

abashkeev · 2025-07-31T12:45:20Z

Do we need it in 25.7 or we can do it in 25.8 in a regular way?

divanik · 2025-07-31T14:07:29Z

@abashkeev, I initially attempted to backport this only because it's a prerequisite PR for a bug fix, and the release was made just ten days ago. However, after considering your question, I've reassessed the situation and determined that backporting this future fix to versions ≤25.7 would be very tedious. Therefore, let's close all backports for now.

Also Nikita said that we are not going to release 25.7 to cloud, this is another reason to close backports

Fix rebase issue: - 20250806 ClickHouse#84821 - 20250804 ClickHouse#83997 - 20250728 ClickHouse#84180 - 20250713 ClickHouse#82949 - 20250703 ClickHouse#82934 - 20250626 ClickHouse#80931 - 20250604 ClickHouse#79649 - 20250502 ClickHouse#79180 - 20250416 ClickHouse#78485 - 20250306 ClickHouse#76662 Co-authored-by: liuneng1994 <[email protected]>

divanik added 2 commits July 16, 2025 17:49

Add test (TDD)

5445446

Merge branch 'master' of github.com:ClickHouse/Clickhouse into divani…

9d35033

…k/fix_parser_group_usage_with_schema_evolution

divanik added 3 commits July 18, 2025 16:32

Try to pass settings correctly

1360456

Fix existing iceberg integration tests

78c5c25

Make possible fix

18ba884

divanik requested a review from al13n321 July 18, 2025 17:29

al13n321 reviewed Jul 18, 2025

View reviewed changes

al13n321 mentioned this pull request Jul 18, 2025

Parquet reader v3 #82789

Merged

clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Jul 22, 2025

divanik added 9 commits July 22, 2025 18:34

Save refactoring progress

05547ec

Intermediate commit

0ed5abc

Merge branch 'master' of github.com:ClickHouse/ClickHouse into divani…

d7f583d

…k/fix_parser_group_usage_with_schema_evolution

Built version

b1e9760

Finish refactorring

13eefce

Fix bugs

0d53e06

Merge branch 'master' of github.com:ClickHouse/ClickHouse into divani…

cb29e13

…k/fix_parser_group_usage_with_schema_evolution

Remove nullptr deref

87572f7

Fix null pointer deref, second attempt

16b01f6

divanik assigned al13n321 Jul 23, 2025

divanik requested a review from al13n321 July 23, 2025 13:48

Refactor files

17e1ba6

divanik marked this pull request as ready for review July 23, 2025 14:04

divanik added 4 commits July 23, 2025 16:20

Minor fix style_check

4e6aa6a

Unify naming

6615df0

Remove test fix

f9b808f

Merge branch 'master' of github.com:ClickHouse/ClickHouse into divani…

1fd0d32

…k/fix_parser_group_usage_with_schema_evolution

divanik removed the pr-must-backport-cloud label Jul 31, 2025

robot-ch-test-poll4 added pr-backports-created-cloud deprecated label, NOOP pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR labels Jul 31, 2025

robot-ch-test-poll2 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Jul 31, 2025

baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 7, 2025

Fix build due to ClickHouse/ClickHouse#83997

56dad51

baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 8, 2025

Fix build due to ClickHouse/ClickHouse#83997

8c60674

baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 11, 2025

Fix build due to ClickHouse/ClickHouse#83997

9d91270

baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 12, 2025

Fix build due to ClickHouse/ClickHouse#83997

f0c8f55

baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 13, 2025

Fix build due to ClickHouse/ClickHouse#83997

fbd0a51

baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 14, 2025

Fix build due to ClickHouse/ClickHouse#83997

2d4aef2

baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 15, 2025

Fix build due to ClickHouse/ClickHouse#83997

8ba30bc

baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 16, 2025

Fix build due to ClickHouse/ClickHouse#83997

777417c

baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Aug 18, 2025

Fix build due to ClickHouse/ClickHouse#83997

a199bfd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor FormatParserGroup to split resources management and filter condition logic#83997

Refactor FormatParserGroup to split resources management and filter condition logic#83997
divanik merged 23 commits intomasterfrom
divanik/fix_parser_group_usage_with_schema_evolution

divanik commented Jul 18, 2025 •

edited

Loading

Uh oh!

clickhouse-gh bot commented Jul 18, 2025 •

edited

Loading

Uh oh!

al13n321 Jul 18, 2025 •

edited

Loading

Uh oh!

divanik Jul 22, 2025 •

edited

Loading

Uh oh!

divanik commented Jul 22, 2025 •

edited

Loading

Uh oh!

divanik commented Jul 22, 2025

Uh oh!

divanik commented Jul 24, 2025 •

edited

Loading

Uh oh!

divanik commented Jul 31, 2025

Uh oh!

abashkeev commented Jul 31, 2025

Uh oh!

divanik commented Jul 31, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

	std::shared_ptr<const ActionsDAG> IcebergSchemaProcessor::getSchemaTransformationDagByIds(Int32 old_id, Int32 new_id)
	{
	if (old_id == new_id)
	return nullptr;

	std::lock_guard lock(mutex);
	auto required_transform_dag_it = transform_dags_by_ids.find({old_id, new_id});
	if (required_transform_dag_it != transform_dags_by_ids.end())
	return required_transform_dag_it->second;

	auto old_schema_it = iceberg_table_schemas_by_ids.find(old_id);
	if (old_schema_it == iceberg_table_schemas_by_ids.end())
	throw Exception(ErrorCodes::BAD_ARGUMENTS, "Schema with schema-id {} is unknown", old_id);

	auto new_schema_it = iceberg_table_schemas_by_ids.find(new_id);
	if (new_schema_it == iceberg_table_schemas_by_ids.end())
	throw Exception(ErrorCodes::BAD_ARGUMENTS, "Schema with schema-id {} is unknown", new_id);

	return transform_dags_by_ids[{old_id, new_id}]
	= getSchemaTransformationDag(old_schema_it->second, new_schema_it->second, old_id, new_id);
	}

Conversation

divanik commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category:

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Uh oh!

clickhouse-gh bot commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

al13n321 Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

divanik Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

divanik commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

divanik commented Jul 22, 2025

Uh oh!

divanik commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

divanik commented Jul 31, 2025

Uh oh!

abashkeev commented Jul 31, 2025

Uh oh!

divanik commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

divanik commented Jul 18, 2025 •

edited

Loading

clickhouse-gh bot commented Jul 18, 2025 •

edited

Loading

al13n321 Jul 18, 2025 •

edited

Loading

divanik Jul 22, 2025 •

edited

Loading

divanik commented Jul 22, 2025 •

edited

Loading

divanik commented Jul 24, 2025 •

edited

Loading

divanik commented Jul 31, 2025 •

edited

Loading