ArrowStream processing crash if non unique dictionary by ilejn · Pull Request #87863 · ClickHouse/ClickHouse

ilejn · 2025-09-29T21:15:18Z

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

ClickHouse crashes if ArrowStream file has non-unique dictionary

Documentation entry for user-facing changes

The issue was observed in prod environment (no ideas how data was created).
I've created sample file by the script

import pyarrow as pa
import pyarrow.ipc as ipc

# Example data
# dict_strings = ["apple", "banana", "cherry"]  # low cardinality dictionary
dict_strings = ["apple", "banana", "banana"]  # low cardinality dictionary
indices = [2]               # indexes into dict_strings

# Create Arrow arrays
# 1) Dictionary array for strings
dictionary_array = pa.array(dict_strings)

# 2) Indices array 
indices_array = pa.array(indices, type=pa.int32())

# Create dictionary-encoded array using the indices and dictionary
dict_encoded_array = pa.DictionaryArray.from_arrays(indices_array, dictionary_array)

table = pa.Table.from_arrays([dict_encoded_array], names=["col_str"])

# Write to Arrow Stream file
with pa.OSFile("output_unique.arrowstream", "wb") as sink:
    with ipc.RecordBatchStreamWriter(sink, table.schema) as writer:
        writer.write_table(table)

The point is 'banana' is duplicated.
ClickHouse creates LowCardinality column for two meaningful elements ('apple' and 'banana') and index '2' is out of bound.

Backtrace

025.09.20 13:27:23.707456 [ 1757039 ] {} <Fatal> ClientBase: 4. raise @ 0x000000000004527e
2025.09.20 13:27:23.707477 [ 1757039 ] {} <Fatal> ClientBase: 5. __GI_abort @ 0x00000000000288ff
2025.09.20 13:27:23.707506 [ 1757039 ] {} <Fatal> ClientBase: 6. _nl_load_domain.cold @ 0x000000000002881b
2025.09.20 13:27:23.707536 [ 1757039 ] {} <Fatal> ClientBase: 7. ? @ 0x000000000003b517
2025.09.20 13:27:23.720357 [ 1757039 ] {} <Fatal> ClientBase: 8.0. inlined from /home/ilejn/projects/ClickHouse/src/Common/PODArray.h:386: DB::PODArray<char8_t, 4096ul, Allocator<false, false>, 63ul, 64ul>::operator[](long) const
2025.09.20 13:27:23.720391 [ 1757039 ] {} <Fatal> ClientBase: 8.1. inlined from /home/ilejn/projects/ClickHouse/src/Columns/ColumnNullable.h:55: DB::ColumnNullable::isNullAt(unsigned long) const
2025.09.20 13:27:23.720404 [ 1757039 ] {} <Fatal> ClientBase: 8. /home/ilejn/projects/ClickHouse/src/DataTypes/Serializations/SerializationNullable.cpp:241: DB::SerializationNullable::serializeTextEscaped(DB::IColumn const&, unsigned long, DB::WriteBuffer&, DB::FormatSettings const&) const @ 0x000000001535e889
2025.09.20 13:27:23.730228 [ 1757039 ] {} <Fatal> ClientBase: 9. /home/ilejn/projects/ClickHouse/src/DataTypes/Serializations/SerializationLowCardinality.cpp:830: void DB::SerializationLowCardinality::serializeImpl<DB::WriteBuffer&, DB::FormatSettings const&, DB::WriteBuffer&, DB::FormatSettings const&>(DB::IColumn const&, unsigned long, void (DB::ISerialization::*)(DB::IColumn const&, unsigned long, DB::WriteBuffer&, DB::FormatSettings const&) const, DB::WriteBuffer&, DB::FormatSettings const&) const @ 0x0000000015354856
2025.09.20 13:27:23.739946 [ 1757039 ] {} <Fatal> ClientBase: 10. /home/ilejn/projects/ClickHouse/src/DataTypes/Serializations/SerializationLowCardinality.cpp:730: DB::SerializationLowCardinality::serializeTextEscaped(DB::IColumn const&, unsigned long, DB::WriteBuffer&, DB::FormatSettings const&) const @ 0x0000000015353e55
2025.09.20 13:27:23.747865 [ 1757039 ] {} <Fatal> ClientBase: 11. /home/ilejn/projects/ClickHouse/src/Processors/Formats/IRowOutputFormat.cpp:78: DB::IRowOutputFormat::write(std::vector<COW<DB::IColumn>::immutable_ptr<DB::IColumn>, std::allocator<COW<DB::IColumn>::immutable_ptr<DB::IColumn>>> const&, unsigned long) @ 0x0000000019eae5c4
2025.09.20 13:27:23.755135 [ 1757039 ] {} <Fatal> ClientBase: 12. /home/ilejn/projects/ClickHouse/src/Processors/Formats/IRowOutputFormat.cpp:31: DB::IRowOutputFormat::consume(DB::Chunk) @ 0x0000000019eae30c
2025.09.20 13:27:23.763541 [ 1757039 ] {} <Fatal> ClientBase: 13. /home/ilejn/projects/ClickHouse/src/Processors/Formats/IOutputFormat.cpp:153: DB::IOutputFormat::write(DB::Block const&) @ 0x0000000019ead3e3
2025.09.20 13:27:23.821999 [ 1757039 ] {} <Fatal> ClientBase: 14. /home/ilejn/projects/ClickHouse/src/Client/ClientBase.cpp:535: DB::ClientBase::onData(DB::Block&, std::shared_ptr<DB::IAST>) @ 0x0000000019c3abb1
2025.09.20 13:27:23.881718 [ 1757039 ] {} <Fatal> ClientBase: 15. /home/ilejn/projects/ClickHouse/src/Client/ClientBase.cpp:1395: DB::ClientBase::receiveAndProcessPacket(std::shared_ptr<DB::IAST>, bool) @ 0x0000000019c4374a
2025.09.20 13:27:23.931084 [ 1757039 ] {} <Fatal> ClientBase: 16. /home/ilejn/projects/ClickHouse/src/Client/ClientBase.cpp:1361: DB::ClientBase::receiveResult(std::shared_ptr<DB::IAST>, int, bool) @ 0x0000000019c42fe4
2025.09.20 13:27:23.974651 [ 1757039 ] {} <Fatal> ClientBase: 17. /home/ilejn/projects/ClickHouse/src/Client/ClientBase.cpp:1270: DB::ClientBase::processOrdinaryQuery(String, std::shared_ptr<DB::IAST>) @ 0x0000000019c41bfc

ilejn · 2025-09-29T21:15:57Z

Hello Pavel @Avogar ,
could you have a look, please?

clickhouse-gh · 2025-09-30T05:17:03Z

Workflow [PR], commit [61ceb5f]

Summary: ✅

Avogar · 2025-09-30T05:35:26Z

I don't like that we change anything in ColumnUnique to fix something in Arrow format. We already have check for bad indices here:

ClickHouse/src/Processors/Formats/Impl/ArrowColumnToCHColumn.cpp

Lines 649 to 657 in c0347e6

    
           /// Check that indexes are correct (protection against corrupted files) 
        
           /// Note that on null values index can be arbitrary value. 
        
           for (int64_t i = 0; i != chunk->length(); ++i) 
        
           { 
        
               if (!chunk->IsNull(i) && (data[i] < 0 || data[i] >= dict_size)) 
        
                   throw Exception(ErrorCodes::INCORRECT_DATA, 
        
                                   "Index {} in Dictionary column is out of bounds, dictionary size is {}", 
        
                                   Int64(data[i]), UInt64(dict_size)); 
        
           }

But this check is not really good for 2 reasons:

dict_size is taked from arrow dictionary size

ClickHouse/src/Processors/Formats/Impl/ArrowColumnToCHColumn.cpp

Line 1306 in c0347e6

dict_info.dictionary_size = arrow_dict_column->length();

but we need to set our LC dictionary size instead
It doesn't take into account possible shift of indices that may happen in this function next.

We need to fix these 2 problems and there will be no need to change ColumnUnique

ilejn · 2025-09-30T11:10:28Z

@Avogar , thank you for fast response.

I am not 100% sure, but we probably have same problem in ORC (but there is no check that we have for Arrow you are referring to).
After all, throwing exception if data is not unique seems natural for a method called uniqueInsertRangeFrom.

It seems that getting dict_info.dictionary_size from arrow has advantages (nulls, if I am not mistaken).

Please, reconfirm if you do think that we should keep uniqueInsertRangeFrom intact and make checks at Arrow level.

Avogar · 2025-09-30T13:40:05Z

After all, throwing exception if data is not unique seems natural for a method called uniqueInsertRangeFrom.

It's not natural. uniqueInsertRangeFrom is also used to insert data into LC from not LC column where we can have duplicates, that's why it ignores them in the first place.

Actually, maybe we can allow inserting such data, we can just create a mapping from arrow indexes to our LC indexes based on the positions that are returned from uniqueInsertRangeFrom. We can even use this mapping in a general case and avoid this code with indexes shifting (if it will slow down the insertion significantly, we can have separate logic for case when have dupliacates and when we don't have).

ilejn · 2025-09-30T13:50:10Z

Of course I realize that throwing exception cannot be the only behavior.
Anyway, I take your response as I have to check at Arrow level and keep uniqueInsertRangeFrom intact.
Ok, will do.

It seems that Arrow file with non unique dictionary is seriously broken, it cannot be read by pandas and suchlike.
That's why I don't think that we should care.

Avogar · 2025-09-30T13:53:53Z

Probably the best way to check if dictionary contains duplicates at Arrow level - check for duplicates in positions column uniqueInsertRangeFrom returns. Then there is no need to change any other code.

Or even just compare the size of the arrow dictionary and the size of the dictinary created inside LC (but it will require a careful check for cases when we have/don't have default value in arrow dict and when dictionary is Nullable)

ilejn · 2025-09-30T13:59:00Z

compare the size of the arrow dictionary and the size of the dictinary created inside LC (but it will require a careful check for cases when we have/don't have default value in arrow dict and when dictionary us Nullable)

That's the plan.
Actually I started from this approach before switching to what I currently propose.

This reverts commit 6b85495.

ilejn · 2025-09-30T22:32:50Z

Unfortunately there is no obvious way to create a file with default values via pyarrow, but the check seems to be simple and transparent.

src/Processors/Formats/Impl/ArrowColumnToCHColumn.cpp

ilejn · 2025-10-02T08:53:09Z

Hello @Avogar , can we merge it, or you feel that we need additional tests or something?

aadant · 2025-10-03T15:25:03Z

@ilejn see this discussion on Arrow

apache/arrow#47134
apache/arrow#47151

ilejn · 2025-10-03T15:43:33Z

@ilejn see this discussion on Arrow

apache/arrow#47134 apache/arrow#47151

Thanks, interesting.

ArrowStream processing crash if non unique dictionary

24.8.14 Backport of ClickHouse#87863: ArrowStream processing crash if non unique dictionary

non_uniq_arrow_dict: initial, test works

6b85495

Avogar self-assigned this Sep 30, 2025

Avogar added the can be tested Allows running workflows for external contributors label Sep 30, 2025

clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Sep 30, 2025

ilejn added 2 commits October 1, 2025 01:16

Revert "non_uniq_arrow_dict: initial, test works"

4a5a25a

This reverts commit 6b85495.

non_uniq_arrow_dict: check moved to ArrowColumnToCHColumn

45d30dc

Avogar reviewed Oct 1, 2025

View reviewed changes

src/Processors/Formats/Impl/ArrowColumnToCHColumn.cpp Outdated Show resolved Hide resolved

ilejn added 3 commits October 1, 2025 10:03

non_uniq_arrow_dict: try to make clang-tidy happy

63240a4

non_uniq_arrow_dict: another silly attempt to make clang-tidy happy

003d0c0

Merge remote-tracking branch 'origin/master' into non_uniq_arrow_dict

61ceb5f

Avogar approved these changes Oct 1, 2025

View reviewed changes

Avogar added this pull request to the merge queue Oct 3, 2025

Merged via the queue into ClickHouse:master with commit 20b0f8d Oct 3, 2025
123 checks passed

robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Oct 3, 2025

ilejn pushed a commit to Altinity/ClickHouse that referenced this pull request Oct 6, 2025

Merge pull request ClickHouse#87863 from ilejn/non_uniq_arrow_dict

d61094c

ArrowStream processing crash if non unique dictionary

ilejn mentioned this pull request Oct 6, 2025

24.8.14 Backport of #87863: ArrowStream processing crash if non unique dictionary Altinity/ClickHouse#1067

Merged

zvonand added a commit to Altinity/ClickHouse that referenced this pull request Oct 10, 2025

Merge pull request #1067 from Altinity/backports/24.8.14/87863

73b4f3b

24.8.14 Backport of ClickHouse#87863: ArrowStream processing crash if non unique dictionary

zvonand mentioned this pull request Oct 16, 2025

24.8.14 Stable Pre-release PR Altinity/ClickHouse#1085

Merged

Conversation

ilejn commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

ilejn commented Sep 29, 2025

Uh oh!

clickhouse-gh bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Avogar commented Sep 30, 2025

Uh oh!

ilejn commented Sep 30, 2025

Uh oh!

Avogar commented Sep 30, 2025

Uh oh!

ilejn commented Sep 30, 2025

Uh oh!

Avogar commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilejn commented Sep 30, 2025

Uh oh!

ilejn commented Sep 30, 2025

Uh oh!

Uh oh!

ilejn commented Oct 2, 2025

Uh oh!

Uh oh!

aadant commented Oct 3, 2025

Uh oh!

ilejn commented Oct 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ilejn commented Sep 29, 2025 •

edited

Loading

clickhouse-gh bot commented Sep 30, 2025 •

edited

Loading

Avogar commented Sep 30, 2025 •

edited

Loading