Add parquet bloom filters support#62966
Conversation
|
This is an automated comment for commit 424cd1b with description of existing statuses. It's updated for the latest CI running ✅ Click here to open a full report in a separate page Successful checks
|
… implement in and has conditions
al13n321
left a comment
There was a problem hiding this comment.
I'm confused about the if (hashes.size() != data_column->size()) line, and there are a couple remaining optional nitpicks, otherwise good to go.
| * `KeyCondition::RPNElement::FUNCTION_IN_RANGE` becomes: | ||
| * `KeyCondition::RPNElement::FUNCTION_IN_SET` becomes |
There was a problem hiding this comment.
| * `KeyCondition::RPNElement::FUNCTION_IN_RANGE` becomes: | |
| * `KeyCondition::RPNElement::FUNCTION_IN_SET` becomes | |
| * `KeyCondition::RPNElement::FUNCTION_IN_RANGE` or `FUNCTION_IN_SET` becomes: |
| } | ||
| } | ||
|
|
||
| if (hashes.size() != data_column->size()) |
There was a problem hiding this comment.
Wait, how does this not fail when hashes.empty()? How do tests pass?
There was a problem hiding this comment.
I missed this case, sorry. Fixed by returning optional vector. Also added a test that confirmed the LOGICAL_ERROR exception.
al13n321
left a comment
There was a problem hiding this comment.
One more bug, need to check if monotonic_functions_chain is empty.
| * 1. file(..., 'string_column UInt64') where str_column = '50'; Field.type == UInt64. Parquet type string. Not supported. | ||
| * 2. file(...) where str_column = 50; Field.type == String (conversion already taken care by `KeyCondition`). Parquet type string. | ||
| * 3. file(...) where uint32_column = toIPv4(5). Field.type == IPv4. Incompatible column types, resolved by `KeyCondition` itself. | ||
| * 4. file(...) where toIPv4(uint32_column) = toIPv4(5). Field.type == IPv4. We know it is safe to hash it using an int32 API. |
There was a problem hiding this comment.
Oh, KeyCondition accepts this because toIPv4 is a monotonic function, and it puts the function in RPNElement::monotonic_functions_chain.
Wait, but then does it do the same for other functions? E.g. negate(x) = -42 - negate is a monotonic function. I.e. RPNElement will have range = [-42, -42] and monotonic_functions_chain = {negate}. I checked, and this is indeed what happens. So keyConditionRPNToParquetBloomFilterCondition will incorrectly look for value -42 in the bloom filter instead of 42. You have to check if monotonic_functions_chain is empty and do FUNCTION_UNKNOWN if not. Please add a test to reproduce the error first.
(The same happens for x + 10 = 52 - it handles binary function with one const argument.)
(Instead of FUNCTION_UNKNOWN we could try to apply the inverse function using IFunction::getPreimage, but that's not important.)
Fairly interesting. I just tested it and it would indeed be a problem. It is a problem for min/max stats evaluation as well: |
I couldn't reproduce. What's in insert into function file('t.parquet') select 42::Int64 as x settings engine_file_truncate_on_insert=1;
select * from file('t.parquet') where negate(x) = -42 settings input_format_parquet_filter_push_down=1
SELECT *
FROM file('t.parquet')
WHERE (-x) = -42
SETTINGS input_format_parquet_filter_push_down = 1
Query id: 472f9083-ac38-4aa5-9cc4-2191f45639ca
┌──x─┐
1. │ 42 │
└────┘
1 row in set. Elapsed: 0.003 sec.
-- check that index is used at all:
select * from file('t.parquet') where indexHint(negate(x) = -43) settings input_format_parquet_filter_push_down=1
SELECT *
FROM file('t.parquet')
WHERE indexHint((-x) = -43)
SETTINGS input_format_parquet_filter_push_down = 1
Query id: 3f2e7936-ec56-431d-a399-c0e6914257d4
Ok.
0 rows in set. Elapsed: 0.003 sec. Min/max thing shouldn't have this exact problem because it calls KeyCondition::checkInHyperrectangle, which applies the monotonic functions. |
It should be the same as the file in this PR. Maybe I missed something while testing, will double check tomorrow. |
(cherry picked from commit cbbf128e9403b1598cfe62543414d7d2378c33ed)
(cherry picked from commit cbbf128e9403b1598cfe62543414d7d2378c33ed)
* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20241016) * Fix Build due to ClickHouse/ClickHouse#70158 * Fix build due to ClickHouse/ClickHouse#62966 * add more info for exception * Fix ut due to ClickHouse/ClickHouse#69481 --------- Co-authored-by: kyligence-git <[email protected]> Co-authored-by: Chang Chen <[email protected]>
Running this query with bloom_filter = false and minmax_filter = true returns inconsistent results |
|
Minimized to: insert into function file('t.parquet') select 12345678901234567890::UInt64 as x union all select 42::UInt64 as x settings engine_file_truncate_on_insert=1;
select * from file('t.parquet') where -x = -42;The problem is that |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add support for parquet bloom filters
Documentation entry for user-facing changes
Modify your CI run:
NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step
Include tests (required builds will be added automatically):
Exclude tests:
Extra options:
Only specified batches in multi-batch jobs: