A new parquet reader that supports filter push down, which improves the total time on clickbench by 50+% compared to arrow parquet reader by liuneng1994 · Pull Request #70611 · ClickHouse/ClickHouse

liuneng1994 · 2024-10-14T05:45:56Z

A parquet reader with native support for filter push down.

Supports most commonly used data types, and supports automatic conversion for derived types
Supports nested type reading
Supports push-down of common expressions
Supports direct calculation and filtering of some expressions on the decompression buffer
Significantly improves performance by delaying materialization of non-conditional columns
Dynamically skips page decompression based on the result of conditional expressions at runtime

Changelog category (leave one):

Experimental Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

A parquet reader with native support for filter push down.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

CI Settings (Only check the boxes if you know what you are doing):

Allow: All Required Checks
Allow: Stateless tests
Allow: Stateful tests
Allow: Integration Tests
Allow: Performance tests
Allow: All Builds
Allow: batch 1, 2 for multi-batch jobs
Allow: batch 3, 4, 5, 6 for multi-batch jobs

Exclude: Style check
Exclude: Fast test
Exclude: All with ASAN
Exclude: All with TSAN, MSAN, UBSAN, Coverage
Exclude: All with aarch64, release, debug

Run only fuzzers related jobs (libFuzzer fuzzers, AST fuzzers, etc.)
Exclude: AST fuzzers

CLAassistant · 2024-10-14T05:46:04Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ yxheartipp
❌ liuneng1994

liuneng1994 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

al13n321 · 2025-03-10T23:58:53Z

src/Processors/Formats/Impl/Parquet/ParquetColumnReaderFactory.cpp

+/// Create a column reader for a specific physical type and target type.
+/// \tparam physical_type physical type in parquet file
+/// \tparam target_type target type in ClickHouse
+/// \tparam dict is dictionary column or not


I didn't read this in detail yet, but it seems strange to have separate readers for dict and non-dict cases. Column chunk may have some pages encoded with dictionary, then some pages encoded without dictionary (parquet writer falls back to non-dict encoding if the dictionary gets too big). So the dict = true reader must be able to switch to non-dict encoding on the fly. So why not use the same reader for both cases? If dict = false, initialize the reader in a state as if it already switched from dict to non-dict encoding.

al13n321 · 2025-03-11T03:16:15Z

This is probably outside the scope of this PR, but I think there's a better way to do PREWHERE-like prefiltering and input_format_parquet_bloom_filter_push_down. Especially when reading from network (rather than local file), especially with high latency. It seems that the best read strategy would be to read in 4 stages, pipelined as 4 sliding windows of tasks running in parallel (with some dependencies between them):

Filter prefetch: read file byte ranges corresponding to bloom filters and/or the smallest/most-promising filter columns (similar to PREWHERE). Runs on IO threads with very high parallelism (important when reading from network), for lots of row groups at once, because these filters/columns are small and don't take a lot of memory per row group.
Decompress and decode those bloom filters and filter columns. Evaluate filter expressions and produce filter masks. On non-IO threads, with parallelism matching the number of cores. One task per row group is ok, no need to parallelize by individual columns. The resulting filter masks are small (especially if we bit-pack them), so we can afford these tasks to run far ahead of the main data reading, usually all the way to the end of the file (for good read size estimate for the progress bar).
Main data prefetch: read file byte ranges corresponding to the remaining columns. Use filter masks from previous step and page offset index to determine byte ranges to read. That's why this is a separate step from filtering - we want to know tight read ranges before we start reading. Runs on IO threads, with medium parallelism. Can't have very high parallelism because this takes a lot of memory. But higher than decompression+decoding because data is still compressed at this stage. Can read byte ranges within the same row group in parallel, can merge multiple small row groups into one byte range.
Decompress and decode the remaining columns. Runs on non-IO threads, with parallelism matching the number of cores, or limited by memory. Parallelized at the level of primitive columns, not row groups (this is a problem for the current reader that can use at most one thread per row group, but we often don't have enough memory for many row groups) (this is another reason to build filter masks separately - it allows reading columns independently from each other). Composite columns like tuples can be reassembled at the very end before returning the block from ParquetBlockInputFormat.

Each of these would be pipelined across row groups, with a low and high watermark. The dependencies are simple: next stage for a row group can begin when the previous stage completed for that row group (but we won't necessarily begin one row group at a time; for small row groups we'll wait for a few row groups to be ready and schedule them as one task, based a target byte size).

(As a slight generalization, there can be multiple filtering stages: first by the smallest and most promising filter columns, then by other filter columns, then read main data. This hopefully won't be needed.)

(Another consideration is that sometimes when reading filter byte ranges we'll also read some data along the way, if gaps between ranges are small enough that we decided to read instead of seeking. Would be nice to preserve and reuse that read data in later stages instead of reading again. But other times we would rather deallocate it to free up more memory for later stages. Not sure how exactly to handle that.)

Maybe I'll make a native reader v3 at some point :) . This is speculative though, please finish this PR too.

(I previously assumed that this PR would have close to optimal performance and would need at most need incremental optimizations/simplifications. But now that I wrote down the above idea, it sounds very different from this PR, and potentially significantly faster and simpler (though it's likely that I missed something and it won't work). Probably faster to prototype it from scratch than modifying reader v2.)

al13n321 · 2025-03-11T00:07:37Z