Conversation
…k/fix_parser_group_usage_with_schema_evolution
|
Workflow [PR], commit [41377be] Summary: ❌
|
| max_block_size, | ||
| format_settings, | ||
| parser_group, | ||
| !schema_was_changed ? parser_group : nullptr, |
There was a problem hiding this comment.
I guess the story here is:
Different files may have different schemas (because they were written before some ALTERs). But the FormatParserGroup thing doesn't support that, it assumes that all files read by one query share things like KeyCondition, PrewhereInfo (soon), and ColumnMapper. So here we use a shared FormatParserGroup only for files with latest schema, while each old-schema file gets its own separate FormatParserGroup with no KeyCondition. This fixes correctness but breaks resource sharing among the file readers (shared thread count limits max_parsing_threads/max_io_threads/parsing_runner/io_runner, soon shared memory limits in ParserGroupExt in #82789 ).
But:
- Need to assign
column_mapperin per-file parser groups? - Removing KeyCondition makes ALTER a big performance landmine. E.g. I imagine people often do queries with
WHERE timestamp BETWEEN [some short range]and rely on the parquet filter pushdown (using row group min/max stats) to read only a small fraction of all data. Then they do a small ALTER and boom all queries are full scans now, the service is unusable, and there's no (?) way to undo it. I don't know what to do about it. I guess we should add logic for detecting whether a given sequence of ALTERs interferes with a given KeyCondition. E.g. if none of the columns in the KeyCondition were touched by the ALTERs, keep it as is. (And ideally: if only some of the columns used in KeyCondition were touched, remove those columns but leave the rest of the KeyCondition intact.). (... How does MergeTree solve this? oh, it just doesn't allow ALTERing the primary key; we can't do that here.) - Would be better to properly support different schemas in
FormatParserGroup, so that resource limits are shared correctly. Maybe splitFormatParserGroupinto two things passed to IInputFormat separately (both by shared_ptr): (1) struct with thread/memory limits and thread pools ("FormatParserSharedResources"?), (2) struct with schema, key condition, prewhere info (after Parquet reader v3 #82789), column_mapper; may make sense to directly use existing structReadFromFormatInfo. - After Parquet reader v3 #82789 ,
FormatParserGroupwill have prewhere info, which is not optional. If during query analysis we returned true fromsupportsPrewhere(), the storage must do the prewhere filtering, there's no other pipeline step that will do it. So I guess we should somehow disable prewhere if files have different schemas. It doesn't seem easy to do because currently thesupportsPrewhere()check is not tied to a storage snapshot, so the "files have different schemas" condition may change between the check and the read. How does normal MergeTree deal with this? I.e. if you run an alter to swap two columns, and while it's running you do a SELECT with PREWHERE on one of these columns, and only a subset of parts have been mutated, how does this PREWHERE work on the non-mutated parts? How does it know to swap the columns before evaluating the expression? Hopefully we can reuse that in parquet reader v3, somehow? (I guess this is mostly my problem as the author of Parquet reader v3 #82789 , but maybe you already know some answers, that would be helpful.)
(Btw, how does schema evolution work on the read path? Does it add something to the query pipeline to run transformations corresponding to the history of ALTERs? E.g. if two columns' names were swapped, we have to read them from the format, then swap their names in the block - there's some code that does this swapping, right? I wish there were comments explaining it and pointing to that code. How does this interact with KeyCondition stuff? The condition would have to be somehow adjusted to work on pre-ALTER data, e.g. with the two columns swapped? I guess we don't have code for such adjustment and should instead clear the KeyCondition and rely on normal WHERE step for filtering?)
There was a problem hiding this comment.
Yeap, thanks, I intentionally assigned you because you introduced the logic, and I really needed your opinion to understand what I can break by this fix. We need to recover correctness as soon as possible and backport the bug, so I need your help.
This fixes correctness but breaks resource sharing among the file readers (shared thread count limits max_parsing_threads/max_io_threads/parsing_runner/io_runner, soon shared memory limits in ParserGroupExt in #82789 ).
Are there any suggestions how to overcome this?
Need to assign column_mapper in per-file parser groups?
I don't know what the column_mapper is, need to read some code, will comment on this.
Removing KeyCondition makes ALTER a big performance landmine. E.g. I imagine people often do queries with WHERE timestamp BETWEEN [some short range] and rely on the parquet filter pushdown (using row group min/max stats) to read only a small fraction of all data. Then they do a small ALTER and boom all queries are full scans now, the service is unusable, and there's no (?) way to undo it.
But the broken correctness is more important than performance issues, we can discuss performance here in a separate issue. We had full scans in this place before your recent PR with FormatParserGroup, so reintroducing it here is not a tragedy, imho.
I guess we should add logic for detecting whether a given sequence of ALTERs interferes with a given KeyCondition.
For Iceberg seems doable, let's discuss it in a separate issue.
Would be better to properly support different schemas in FormatParserGroup, so that resource limits are shared correctly. Maybe split FormatParserGroup into two things passed to IInputFormat separately (both by shared_ptr): (1) struct with thread/memory limits and thread pools ("FormatParserSharedResources"?), (2) struct with schema, key condition, prewhere info (after #82789), column_mapper; may make sense to directly use existing struct
Seems good and as an answer hot to unblock the current fix, can you implement it in several days to actually unblock it?
so the "files have different schemas" condition may change between the check and the read
In Iceberg, data files are immutable, file reading schema is tightly connected with data file from its creation. We do not support writing to Iceberg yet as well as DDL-queries. You can consider now that for each select query we define final schema of select query once and do transformations per-file (to match data in file from original schema to a final one). I do not properly know how read-write concurrency works in MT, but I think that current Iceberg architecture more or less can provide "atomicity" here. Either alter query will affect all files in query or no files.
How does this interact with KeyCondition stuff?
I think there are no optimisations in this code which interact with key condition for now (at least I didn't know about them before your PR) and we counted on where condition and some partition pruning stuff.
there's some code that does this swapping, right?
Yes, here it is (the chain of logic which is responsible for it):
https://github.com/ClickHouse/ClickHouse/blob/master/src/Storages/ObjectStorage/StorageObjectStorageSource.cpp#L507-L517
ClickHouse/src/Storages/ObjectStorage/DataLakes/Iceberg/SchemaProcessor.cpp
Lines 404 to 424 in 8ffceee
Overall, you raised very good questions. @alesapin, @kssenii could you take a look at how this stuff can affect delta lake correctness, please?
But for now let's create a plan, what we can do to remove correctness problem in this code. I will spend some time on understanding what column_mapper is and after that I will write a plan how I see the process of resolving this.
|
The plan:
|
|
Also I can't say that I understand everything here:
I think that concurrency problems are orthogonal to the logic in this place (it is handled at another place and primarily data lake metadata level, here we are talking about data parsing). With PREWHERE I don't have enough context to understand the problem, tbh |
…k/fix_parser_group_usage_with_schema_evolution
…k/fix_parser_group_usage_with_schema_evolution
…k/fix_parser_group_usage_with_schema_evolution
|
Want to try backportint it to 25.7 if it is not very difficult |
|
Do we need it in 25.7 or we can do it in 25.8 in a regular way? |
|
@abashkeev, I initially attempted to backport this only because it's a prerequisite PR for a bug fix, and the release was made just ten days ago. However, after considering your question, I've reassessed the situation and determined that backporting this future fix to versions ≤25.7 would be very tedious. Therefore, let's close all backports for now. Also Nikita said that we are not going to release 25.7 to cloud, this is another reason to close backports |
Fix rebase issue: - 20250806 ClickHouse#84821 - 20250804 ClickHouse#83997 - 20250728 ClickHouse#84180 - 20250713 ClickHouse#82949 - 20250703 ClickHouse#82934 - 20250626 ClickHouse#80931 - 20250604 ClickHouse#79649 - 20250502 ClickHouse#79180 - 20250416 ClickHouse#78485 - 20250306 ClickHouse#76662 Co-authored-by: liuneng1994 <[email protected]>
Fix rebase issue: - 20250806 ClickHouse#84821 - 20250804 ClickHouse#83997 - 20250728 ClickHouse#84180 - 20250713 ClickHouse#82949 - 20250703 ClickHouse#82934 - 20250626 ClickHouse#80931 - 20250604 ClickHouse#79649 - 20250502 ClickHouse#79180 - 20250416 ClickHouse#78485 - 20250306 ClickHouse#76662 Co-authored-by: liuneng1994 <[email protected]>
Fix rebase issue: - 20250806 ClickHouse#84821 - 20250804 ClickHouse#83997 - 20250728 ClickHouse#84180 - 20250713 ClickHouse#82949 - 20250703 ClickHouse#82934 - 20250626 ClickHouse#80931 - 20250604 ClickHouse#79649 - 20250502 ClickHouse#79180 - 20250416 ClickHouse#78485 - 20250306 ClickHouse#76662 Co-authored-by: liuneng1994 <[email protected]>
Fix rebase issue: - 20250806 ClickHouse#84821 - 20250804 ClickHouse#83997 - 20250728 ClickHouse#84180 - 20250713 ClickHouse#82949 - 20250703 ClickHouse#82934 - 20250626 ClickHouse#80931 - 20250604 ClickHouse#79649 - 20250502 ClickHouse#79180 - 20250416 ClickHouse#78485 - 20250306 ClickHouse#76662 Co-authored-by: liuneng1994 <[email protected]>
Fix rebase issue: - 20250806 ClickHouse#84821 - 20250804 ClickHouse#83997 - 20250728 ClickHouse#84180 - 20250713 ClickHouse#82949 - 20250703 ClickHouse#82934 - 20250626 ClickHouse#80931 - 20250604 ClickHouse#79649 - 20250502 ClickHouse#79180 - 20250416 ClickHouse#78485 - 20250306 ClickHouse#76662 Co-authored-by: liuneng1994 <[email protected]>
Fix rebase issue: - 20250806 ClickHouse#84821 - 20250804 ClickHouse#83997 - 20250728 ClickHouse#84180 - 20250713 ClickHouse#82949 - 20250703 ClickHouse#82934 - 20250626 ClickHouse#80931 - 20250604 ClickHouse#79649 - 20250502 ClickHouse#79180 - 20250416 ClickHouse#78485 - 20250306 ClickHouse#76662 Co-authored-by: liuneng1994 <[email protected]>
Fix rebase issue: - 20250806 ClickHouse#84821 - 20250804 ClickHouse#83997 - 20250728 ClickHouse#84180 - 20250713 ClickHouse#82949 - 20250703 ClickHouse#82934 - 20250626 ClickHouse#80931 - 20250604 ClickHouse#79649 - 20250502 ClickHouse#79180 - 20250416 ClickHouse#78485 - 20250306 ClickHouse#76662 Co-authored-by: liuneng1994 <[email protected]>
Fix rebase issue: - 20250806 ClickHouse#84821 - 20250804 ClickHouse#83997 - 20250728 ClickHouse#84180 - 20250713 ClickHouse#82949 - 20250703 ClickHouse#82934 - 20250626 ClickHouse#80931 - 20250604 ClickHouse#79649 - 20250502 ClickHouse#79180 - 20250416 ClickHouse#78485 - 20250306 ClickHouse#76662 Co-authored-by: liuneng1994 <[email protected]>
Fix rebase issue: - 20250806 ClickHouse#84821 - 20250804 ClickHouse#83997 - 20250728 ClickHouse#84180 - 20250713 ClickHouse#82949 - 20250703 ClickHouse#82934 - 20250626 ClickHouse#80931 - 20250604 ClickHouse#79649 - 20250502 ClickHouse#79180 - 20250416 ClickHouse#78485 - 20250306 ClickHouse#76662 Co-authored-by: liuneng1994 <[email protected]>
Changelog category:
Improvement
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Splits FormatParserGroup on two independent structs, the first one is responsible for shared compute and IO resources, the second one is responsible for shared filter resources (filter ActionDag, KeyCondition). This is done for more flexible shared usage of these structures by different threads