Parquet reader v3 by al13n321 · Pull Request #82789 · ClickHouse/ClickHouse

al13n321 · 2025-06-28T02:02:20Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

New parquet reader implementation. It's generally faster and supports page-level filter pushdown and PREWHERE. Currently experimental. Use setting input_format_parquet_use_native_reader_v3 to enable.

A from-scratch implementation of parquet reader. Supports PREWHERE, skipping row groups using min/max statistics and bloom filters, skipping pages using min/max statistics, reading columns in parallel (not just row groups), prefetching using separate thread pool, prefetching only what's required (e.g. start prefetching non-PREWHERE columns only after PREWHERE is done, and we know which pages we'll need to read), no unnecessary copying, etc.

Not very finished and not well tested yet, but should be functional enough to benchmark. On all queries I tried it was faster than #70611

clickhouse-gh · 2025-06-28T02:02:46Z

Workflow [PR], commit [ffb51f3]

Summary: ✅

…e elements, idk how to fix it yet, see comment

al13n321 · 2025-07-04T02:55:41Z

thrift submodule change: ClickHouse/thrift#2

taiyang-li · 2025-07-04T02:58:11Z

Great work! Can't wait to use it in Apache Gluten!

rienath

Great PR, can't wait to see the benchmarks! I left some comments, please take a look when you have time. I think most important thing is to have more tests. A lot of cases are not tested and given the size and complexity of this PR, we should anticipate that something will go wrong.

In addition to new tests, I propose to run the old parquet tests with the new reader. Either using Jinja and looping through old/new reader (potentially with different settings) or somehow using randomization

tests/queries/0_stateless/02302_defaults_in_columnar_formats.sql

rienath · 2025-07-16T11:15:58Z

src/Processors/Formats/Impl/Parquet/Reader.h

+namespace DB::Parquet
+{
+
+// TODO [parquet]:


A lot of todos are tests. Makes sense to add them in this PR rather than later

I think it's ok to merge this PR without waiting for me to add tests. The reader is not enabled by default, and I did some amount of testing using the existing stateless tests and manual queries, so it's not super broken.

If I get all tests to work before this PR is accepted, I'll update this PR to enable tests. Up to you whether to wait for that or accept earlier.

The team also recommended to do tests in a follow-up PR. Feel free to merge when ready

src/Processors/Formats/Impl/Parquet/Reader.h

src/Common/threadPoolCallbackRunner.cpp

src/Core/SettingsChangesHistory.cpp

src/Formats/FormatSettings.h

src/Processors/Formats/Impl/ParquetV3BlockInputFormat.h

src/Processors/Formats/Impl/ParquetV3BlockInputFormat.cpp

al13n321 · 2025-08-22T05:40:18Z

Thanks for a thorough review!

nikitamikhaylov · 2025-08-23T21:58:09Z

Stateless tests (amd_binary, old analyzer, s3 storage, DatabaseReplicated, sequential): 01111_create_drop_replicated_db_stress #85774 (comment)
Stateless tests (amd_debug, parallel): 01086_odbc_roundtrip #85973
Stateless tests (arm_binary, parallel) 02435_rollback_cancelled_queries CI db says it is rarely flaky.
Integration tests (amd_binary, 1/5): test_global_overcommit_tracker/test.py::test_global_overcommit #85972

al13n321 added 10 commits June 28, 2025 00:58

Yet another parquet reader

24f5417

Rewrite everything, now it's slow

9fa5650

Now it's not slow

66bc197

Things

9391476

Rebase

2c1dc3b

Refactor scheduling yet again, should be fine now

de15340

All types of filtering

496a53e

Things

e392171

Refactor decoding

a55ff1b

Fixes

610c9c7

clickhouse-gh bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Jun 28, 2025

al13n321 added 9 commits June 28, 2025 02:56

Style and fasttest build

ff3bc76

Unbreak subcolumns; the new parser still doesn't fully work with tupl…

0eaca6e

…e elements, idk how to fix it yet, see comment

lint

b137c83

Unbreak buzzhouse build

747f1f2

tidy

8ee98d6

Why can't clang-tidy report all errors at once

0f7e43e

Oh, it can

1b3cc06

tidy

0264a17

More encodings

098dc76

al13n321 mentioned this pull request Jul 2, 2025

[Very WIP] Yet another parquet reader #78380

Closed

al13n321 added 5 commits July 2, 2025 04:01

Figured out the tuple situation

3dbd1d9

Improve comment a little

2bceef8

Fix arrays and tuples

29d3aed

Boolean

b4753d0

Fixes

49762ff

date_time_overflow_behavior

c855343

Conflict

4ae44cf

rienath reviewed Aug 21, 2025

View reviewed changes

al13n321 added 4 commits August 21, 2025 20:19

Merge remote-tracking branch 'origin/master' into pqv3

f556686

Review comments, ColumnString without zero byte

ca2d6fe

Link to a comment with an image

3b9105c

Oops, I thought I compiled it

e21606d

tidy

6c54783

rienath approved these changes Aug 22, 2025

View reviewed changes

al13n321 and others added 4 commits August 22, 2025 21:03

Merge remote-tracking branch 'origin/master' into pqv3

8da3d51

Fix

1b5e70f

Fix build on Mac

e55288e

Fix clang-tidy

0ac2c97

nikitamikhaylov enabled auto-merge August 23, 2025 21:58

nikitamikhaylov disabled auto-merge August 23, 2025 21:59

al13n321 enabled auto-merge August 24, 2025 01:41

al13n321 added 2 commits August 24, 2025 03:44

Merge remote-tracking branch 'origin/master' into pqv3

817bb0f

Merge remote-tracking branch 'origin/pqv3' into pqv3

ffb51f3

nikitamikhaylov removed the pr-synced-to-cloud The PR is synced to the cloud repo label Aug 24, 2025

al13n321 added this pull request to the merge queue Aug 24, 2025

Merged via the queue into master with commit 9a64c44 Aug 24, 2025
122 checks passed

al13n321 deleted the pqv3 branch August 24, 2025 13:26

robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Aug 24, 2025

al13n321 mentioned this pull request Aug 28, 2025

Fix ReadFromFormatInfo serialization #86397

Merged

This was referenced Aug 31, 2025

Use page index to filt out parquet pages in native parquet reader #65428

Closed

[Feature] Add support for prewhere in parquet native reader #65527

Closed

al13n321 mentioned this pull request Jan 28, 2026

multistage prewhere in parquet reader v3 #93542

Merged

1 task

svb-alt mentioned this pull request Feb 2, 2026

Project Antalya Roadmap 2025 - Real-Time Data Lakes Altinity/ClickHouse#804

Closed

31 tasks

Conversation

al13n321 commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Uh oh!

clickhouse-gh bot commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

al13n321 commented Jul 4, 2025

Uh oh!

taiyang-li commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rienath left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rienath Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

al13n321 Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

rienath Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

al13n321 commented Aug 22, 2025

Uh oh!

nikitamikhaylov commented Aug 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

al13n321 commented Jun 28, 2025 •

edited

Loading

clickhouse-gh bot commented Jun 28, 2025 •

edited

Loading

taiyang-li commented Jul 4, 2025 •

edited

Loading