Conversation
…e elements, idk how to fix it yet, see comment
|
|
|
Great work! Can't wait to use it in Apache Gluten! |
rienath
left a comment
There was a problem hiding this comment.
Great PR, can't wait to see the benchmarks! I left some comments, please take a look when you have time. I think most important thing is to have more tests. A lot of cases are not tested and given the size and complexity of this PR, we should anticipate that something will go wrong.
In addition to new tests, I propose to run the old parquet tests with the new reader. Either using Jinja and looping through old/new reader (potentially with different settings) or somehow using randomization
| namespace DB::Parquet | ||
| { | ||
|
|
||
| // TODO [parquet]: |
There was a problem hiding this comment.
A lot of todos are tests. Makes sense to add them in this PR rather than later
There was a problem hiding this comment.
I think it's ok to merge this PR without waiting for me to add tests. The reader is not enabled by default, and I did some amount of testing using the existing stateless tests and manual queries, so it's not super broken.
If I get all tests to work before this PR is accepted, I'll update this PR to enable tests. Up to you whether to wait for that or accept earlier.
There was a problem hiding this comment.
The team also recommended to do tests in a follow-up PR. Feel free to merge when ready
|
Thanks for a thorough review! |
|
Stateless tests (amd_binary, old analyzer, s3 storage, DatabaseReplicated, sequential): |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
input_format_parquet_use_native_reader_v3to enable.A from-scratch implementation of parquet reader. Supports PREWHERE, skipping row groups using min/max statistics and bloom filters, skipping pages using min/max statistics, reading columns in parallel (not just row groups), prefetching using separate thread pool, prefetching only what's required (e.g. start prefetching non-PREWHERE columns only after PREWHERE is done, and we know which pages we'll need to read), no unnecessary copying, etc.
Not very finished and not well tested yet, but should be functional enough to benchmark. On all queries I tried it was faster than #70611