Skip to content

A few more parquet fixes#87735

Merged
al13n321 merged 4 commits intomasterfrom
pqf2
Oct 27, 2025
Merged

A few more parquet fixes#87735
al13n321 merged 4 commits intomasterfrom
pqf2

Conversation

@al13n321
Copy link
Member

@al13n321 al13n321 commented Sep 27, 2025

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

In parquet writer, emit created_by string in correct format, e.g. ClickHouse version 25.10.1 (build 5b1dfb14925db8901a4e9202cd5d63c11ecfbb9f) instead of ClickHouse v25.9.1.1-testing.
Fix parquet reader compatibility with bad files written by old parquet-mr.


This PR is stacked on top of #87600 , review only commits starting from "A few more parquet fixes".

Closes #87306

  • Workaround for files written by old duckdb with incorrect data_page_offset.
  • Make parquet writer emit created_by string in correct format (according to parquet.thrift): ClickHouse version 25.10.1 (build 5b1dfb14925db8901a4e9202cd5d63c11ecfbb9f) instead of ClickHouse v25.9.1.1-testing.
  • In the old parquet reader, unbreak a workaround for another workaround to allow reading bad files written by old parquet-mr.
  • Fixed clickhouse-test failing if diff output is not valid utf8.
  • Replaced 00900_long_parquet_load.sh with 00900_long_parquet_load_2.sh that:
    • Doesn't use an additional python script and arrow's parquet-reader tool (which is broken: Fix parquet-reader outputting invalid json arrow#72 ).
    • Picks up newly added files automatically without needing to do some manual steps which no one knows about.
    • Doesn't double the number of files in data_parquet/.
    • Is simpler.
    • But: the .reference file needs to be updated whenever a file is added to data_parquet/ or schema inference logic changes.

@clickhouse-gh
Copy link
Contributor

clickhouse-gh bot commented Oct 20, 2025

Workflow [PR], commit [4c02439]

Summary:

job_name test_name status info comment
Bugfix validation (functional tests) failure

@clickhouse-gh clickhouse-gh bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Oct 20, 2025
@al13n321 al13n321 added the no-fast-tests Drops preliminary CI jobs: Style, clang-tidy, Fast Test label Oct 20, 2025
@al13n321 al13n321 removed the no-fast-tests Drops preliminary CI jobs: Style, clang-tidy, Fast Test label Oct 20, 2025
Copy link
Member

@rienath rienath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the wait, here's the review. One minor general comment: Parquet improvements deserve to be in the changelog. Let's change the category

@al13n321 al13n321 force-pushed the pqf2 branch 2 times, most recently from 8981abc to 3eb1602 Compare October 21, 2025 08:55
Copy link
Member

@rienath rienath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's merge #87600, check the CI for this and merge

@al13n321 al13n321 added the ci-non-blocking Do not block pipeline on too many failed tests label Oct 21, 2025
@clickhouse-gh clickhouse-gh bot added pr-bugfix Pull request with bugfix, not backported by default and removed pr-not-for-changelog This PR should not be mentioned in the changelog labels Oct 21, 2025
@al13n321 al13n321 enabled auto-merge October 27, 2025 14:34
@al13n321 al13n321 added this pull request to the merge queue Oct 27, 2025
Merged via the queue into master with commit dfd8c01 Oct 27, 2025
121 of 123 checks passed
@al13n321 al13n321 deleted the pqf2 branch October 27, 2025 14:48
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Oct 27, 2025
@@ -0,0 +1,52 @@
#!/usr/bin/env bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems flaky, see cidb.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mkmkme pushed a commit to Altinity/ClickHouse that referenced this pull request Dec 16, 2025
zvonand added a commit to Altinity/ClickHouse that referenced this pull request Dec 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-non-blocking Do not block pipeline on too many failed tests pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Code: 117. DB::Exception: Dictionary page size out of bounds when reading a Parquet file with input_format_parquet_use_native_reader_v3

5 participants