Fix missing stream exception for nested JSON in wide parts#100475
Fix missing stream exception for nested JSON in wide parts#100475Avogar merged 1 commit intoClickHouse:masterfrom
Conversation
PR ClickHouse#97523 fixed `ColumnObject::permute` and `ColumnDynamic::permute` to propagate `statistics`, but left the same bug in `filter`, `index`, `replicate`, and `scatter` for both `ColumnObject` and `ColumnDynamic`. When a top-level JSON column containing `Array(JSON)` is permuted during INSERT (e.g. MergeTree sorting with `optimize_on_insert=0`), the chain `ColumnObject::permute` -> `ColumnArray::permute` -> `indexImpl` -> `ColumnObject::index` drops statistics on the inner JSON column. This causes a mismatch: `enumerateStreams` (stream creation) uses the `block_sample` column which retains statistics via `cloneEmpty` and chooses 1 bucket (empty shared data optimization), while `serializeBinaryBulkStatePrefix` (serialization) uses the permuted column without statistics and chooses N buckets. The subsequent write to bucket 1 fails with "Stream ... not found" (LOGICAL_ERROR). Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
Workflow [PR], commit [0be0ba4] Summary: ❌
AI ReviewSummaryThis PR fixes a real correctness bug by preserving JSON path statistics through Findings
Tests
ClickHouse Rules
Final Verdict
|
|
|
||
| SET allow_experimental_json_type = 1; | ||
|
|
||
| -- Regression test for a bug where ColumnObject::index (and filter/replicate/scatter) |
There was a problem hiding this comment.
The regression test reproduces the permute -> index path, but this PR also changes filter, replicate, and scatter in both ColumnObject and ColumnDynamic. Please add focused coverage for those transformed-column paths as well, otherwise future regressions in those methods will not be caught.
LLVM Coverage Report
PR changed lines: PR changed-lines coverage: 100.00% (33/33, 0 noise lines excluded) |
| @@ -0,0 +1,42 @@ | |||
| -- Tags: long | |||
|
|
|||
| SET allow_experimental_json_type = 1; | |||
There was a problem hiding this comment.
Yes. But let's keep it, I need this PR to be backported for a customer in cloud, so don't want to wait for CI rerun
f94a13b
Backport #100475 to 26.1: Fix missing stream exception for nested JSON in wide parts
Backport #100475 to 25.12: Fix missing stream exception for nested JSON in wide parts
Backport #100475 to 26.2: Fix missing stream exception for nested JSON in wide parts
Backport #100475 to 26.3: Fix missing stream exception for nested JSON in wide parts
Backport #100475 to 25.8: Fix missing stream exception for nested JSON in wide parts
Backport #100475 to 25.3: Fix missing stream exception for nested JSON in wide parts
PR #97523 fixed
ColumnObject::permuteandColumnDynamic::permuteto propagatestatistics, but left the same bug infilter,index,replicate, andscatterfor bothColumnObjectandColumnDynamic.When a top-level JSON column containing
Array(JSON)is permuted during INSERT (e.g. MergeTree sorting withoptimize_on_insert=0), the chainColumnObject::permute→ColumnArray::permute→indexImpl→ColumnObject::indexdrops statistics on the inner JSON column. This causes a mismatch:enumerateStreams(stream creation) uses theblock_samplecolumn which retains statistics viacloneEmptyand chooses 1 bucket (empty shared data optimization), whileserializeBinaryBulkStatePrefix(serialization) uses the permuted column without statistics and chooses N buckets. The subsequent write to bucket 1 fails with "Stream ... not found" (LOGICAL_ERROR).Reported by customer on version 25.12.1.1459:
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix LOGICAL_ERROR exception "Stream ... not found" when inserting into a table with nested
Array(JSON)columns in wide parts withoptimize_on_insert=0.Documentation entry for user-facing changes