Skip to content

Parquet reader v3#82789

Merged
al13n321 merged 59 commits intomasterfrom
pqv3
Aug 24, 2025
Merged

Parquet reader v3#82789
al13n321 merged 59 commits intomasterfrom
pqv3

Conversation

@al13n321
Copy link
Member

@al13n321 al13n321 commented Jun 28, 2025

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

  • New parquet reader implementation. It's generally faster and supports page-level filter pushdown and PREWHERE. Currently experimental. Use setting input_format_parquet_use_native_reader_v3 to enable.

A from-scratch implementation of parquet reader. Supports PREWHERE, skipping row groups using min/max statistics and bloom filters, skipping pages using min/max statistics, reading columns in parallel (not just row groups), prefetching using separate thread pool, prefetching only what's required (e.g. start prefetching non-PREWHERE columns only after PREWHERE is done, and we know which pages we'll need to read), no unnecessary copying, etc.

Not very finished and not well tested yet, but should be functional enough to benchmark. On all queries I tried it was faster than #70611

@clickhouse-gh
Copy link
Contributor

clickhouse-gh bot commented Jun 28, 2025

Workflow [PR], commit [ffb51f3]

Summary:

@clickhouse-gh clickhouse-gh bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Jun 28, 2025
@al13n321
Copy link
Member Author

al13n321 commented Jul 4, 2025

thrift submodule change: ClickHouse/thrift#2

@taiyang-li
Copy link
Contributor

taiyang-li commented Jul 4, 2025

Great work! Can't wait to use it in Apache Gluten!

Copy link
Member

@rienath rienath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR, can't wait to see the benchmarks! I left some comments, please take a look when you have time. I think most important thing is to have more tests. A lot of cases are not tested and given the size and complexity of this PR, we should anticipate that something will go wrong.

In addition to new tests, I propose to run the old parquet tests with the new reader. Either using Jinja and looping through old/new reader (potentially with different settings) or somehow using randomization

namespace DB::Parquet
{

// TODO [parquet]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of todos are tests. Makes sense to add them in this PR rather than later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to merge this PR without waiting for me to add tests. The reader is not enabled by default, and I did some amount of testing using the existing stateless tests and manual queries, so it's not super broken.

If I get all tests to work before this PR is accepted, I'll update this PR to enable tests. Up to you whether to wait for that or accept earlier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The team also recommended to do tests in a follow-up PR. Feel free to merge when ready

@al13n321
Copy link
Member Author

Thanks for a thorough review!

@nikitamikhaylov
Copy link
Member

Stateless tests (amd_binary, old analyzer, s3 storage, DatabaseReplicated, sequential): 01111_create_drop_replicated_db_stress #85774 (comment)
Stateless tests (amd_debug, parallel): 01086_odbc_roundtrip #85973
Stateless tests (arm_binary, parallel) 02435_rollback_cancelled_queries CI db says it is rarely flaky.
Integration tests (amd_binary, 1/5): test_global_overcommit_tracker/test.py::test_global_overcommit #85972

@nikitamikhaylov nikitamikhaylov removed the pr-synced-to-cloud The PR is synced to the cloud repo label Aug 24, 2025
@al13n321 al13n321 added this pull request to the merge queue Aug 24, 2025
Merged via the queue into master with commit 9a64c44 Aug 24, 2025
122 checks passed
@al13n321 al13n321 deleted the pqv3 branch August 24, 2025 13:26
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Aug 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants