Skip to content

ArrowStream processing crash if non unique dictionary#87863

Merged
Avogar merged 6 commits intoClickHouse:masterfrom
ilejn:non_uniq_arrow_dict
Oct 3, 2025
Merged

ArrowStream processing crash if non unique dictionary#87863
Avogar merged 6 commits intoClickHouse:masterfrom
ilejn:non_uniq_arrow_dict

Conversation

@ilejn
Copy link
Contributor

@ilejn ilejn commented Sep 29, 2025

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

ClickHouse crashes if ArrowStream file has non-unique dictionary

Documentation entry for user-facing changes

The issue was observed in prod environment (no ideas how data was created).
I've created sample file by the script

import pyarrow as pa
import pyarrow.ipc as ipc

# Example data
# dict_strings = ["apple", "banana", "cherry"]  # low cardinality dictionary
dict_strings = ["apple", "banana", "banana"]  # low cardinality dictionary
indices = [2]               # indexes into dict_strings

# Create Arrow arrays
# 1) Dictionary array for strings
dictionary_array = pa.array(dict_strings)

# 2) Indices array 
indices_array = pa.array(indices, type=pa.int32())

# Create dictionary-encoded array using the indices and dictionary
dict_encoded_array = pa.DictionaryArray.from_arrays(indices_array, dictionary_array)

table = pa.Table.from_arrays([dict_encoded_array], names=["col_str"])

# Write to Arrow Stream file
with pa.OSFile("output_unique.arrowstream", "wb") as sink:
    with ipc.RecordBatchStreamWriter(sink, table.schema) as writer:
        writer.write_table(table)

The point is 'banana' is duplicated.
ClickHouse creates LowCardinality column for two meaningful elements ('apple' and 'banana') and index '2' is out of bound.

Backtrace

025.09.20 13:27:23.707456 [ 1757039 ] {} <Fatal> ClientBase: 4. raise @ 0x000000000004527e
2025.09.20 13:27:23.707477 [ 1757039 ] {} <Fatal> ClientBase: 5. __GI_abort @ 0x00000000000288ff
2025.09.20 13:27:23.707506 [ 1757039 ] {} <Fatal> ClientBase: 6. _nl_load_domain.cold @ 0x000000000002881b
2025.09.20 13:27:23.707536 [ 1757039 ] {} <Fatal> ClientBase: 7. ? @ 0x000000000003b517
2025.09.20 13:27:23.720357 [ 1757039 ] {} <Fatal> ClientBase: 8.0. inlined from /home/ilejn/projects/ClickHouse/src/Common/PODArray.h:386: DB::PODArray<char8_t, 4096ul, Allocator<false, false>, 63ul, 64ul>::operator[](long) const
2025.09.20 13:27:23.720391 [ 1757039 ] {} <Fatal> ClientBase: 8.1. inlined from /home/ilejn/projects/ClickHouse/src/Columns/ColumnNullable.h:55: DB::ColumnNullable::isNullAt(unsigned long) const
2025.09.20 13:27:23.720404 [ 1757039 ] {} <Fatal> ClientBase: 8. /home/ilejn/projects/ClickHouse/src/DataTypes/Serializations/SerializationNullable.cpp:241: DB::SerializationNullable::serializeTextEscaped(DB::IColumn const&, unsigned long, DB::WriteBuffer&, DB::FormatSettings const&) const @ 0x000000001535e889
2025.09.20 13:27:23.730228 [ 1757039 ] {} <Fatal> ClientBase: 9. /home/ilejn/projects/ClickHouse/src/DataTypes/Serializations/SerializationLowCardinality.cpp:830: void DB::SerializationLowCardinality::serializeImpl<DB::WriteBuffer&, DB::FormatSettings const&, DB::WriteBuffer&, DB::FormatSettings const&>(DB::IColumn const&, unsigned long, void (DB::ISerialization::*)(DB::IColumn const&, unsigned long, DB::WriteBuffer&, DB::FormatSettings const&) const, DB::WriteBuffer&, DB::FormatSettings const&) const @ 0x0000000015354856
2025.09.20 13:27:23.739946 [ 1757039 ] {} <Fatal> ClientBase: 10. /home/ilejn/projects/ClickHouse/src/DataTypes/Serializations/SerializationLowCardinality.cpp:730: DB::SerializationLowCardinality::serializeTextEscaped(DB::IColumn const&, unsigned long, DB::WriteBuffer&, DB::FormatSettings const&) const @ 0x0000000015353e55
2025.09.20 13:27:23.747865 [ 1757039 ] {} <Fatal> ClientBase: 11. /home/ilejn/projects/ClickHouse/src/Processors/Formats/IRowOutputFormat.cpp:78: DB::IRowOutputFormat::write(std::vector<COW<DB::IColumn>::immutable_ptr<DB::IColumn>, std::allocator<COW<DB::IColumn>::immutable_ptr<DB::IColumn>>> const&, unsigned long) @ 0x0000000019eae5c4
2025.09.20 13:27:23.755135 [ 1757039 ] {} <Fatal> ClientBase: 12. /home/ilejn/projects/ClickHouse/src/Processors/Formats/IRowOutputFormat.cpp:31: DB::IRowOutputFormat::consume(DB::Chunk) @ 0x0000000019eae30c
2025.09.20 13:27:23.763541 [ 1757039 ] {} <Fatal> ClientBase: 13. /home/ilejn/projects/ClickHouse/src/Processors/Formats/IOutputFormat.cpp:153: DB::IOutputFormat::write(DB::Block const&) @ 0x0000000019ead3e3
2025.09.20 13:27:23.821999 [ 1757039 ] {} <Fatal> ClientBase: 14. /home/ilejn/projects/ClickHouse/src/Client/ClientBase.cpp:535: DB::ClientBase::onData(DB::Block&, std::shared_ptr<DB::IAST>) @ 0x0000000019c3abb1
2025.09.20 13:27:23.881718 [ 1757039 ] {} <Fatal> ClientBase: 15. /home/ilejn/projects/ClickHouse/src/Client/ClientBase.cpp:1395: DB::ClientBase::receiveAndProcessPacket(std::shared_ptr<DB::IAST>, bool) @ 0x0000000019c4374a
2025.09.20 13:27:23.931084 [ 1757039 ] {} <Fatal> ClientBase: 16. /home/ilejn/projects/ClickHouse/src/Client/ClientBase.cpp:1361: DB::ClientBase::receiveResult(std::shared_ptr<DB::IAST>, int, bool) @ 0x0000000019c42fe4
2025.09.20 13:27:23.974651 [ 1757039 ] {} <Fatal> ClientBase: 17. /home/ilejn/projects/ClickHouse/src/Client/ClientBase.cpp:1270: DB::ClientBase::processOrdinaryQuery(String, std::shared_ptr<DB::IAST>) @ 0x0000000019c41bfc

@ilejn
Copy link
Contributor Author

ilejn commented Sep 29, 2025

Hello Pavel @Avogar ,
could you have a look, please?

@Avogar Avogar self-assigned this Sep 30, 2025
@Avogar Avogar added the can be tested Allows running workflows for external contributors label Sep 30, 2025
@clickhouse-gh
Copy link
Contributor

clickhouse-gh bot commented Sep 30, 2025

Workflow [PR], commit [61ceb5f]

Summary:

@clickhouse-gh clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Sep 30, 2025
@Avogar
Copy link
Member

Avogar commented Sep 30, 2025

I don't like that we change anything in ColumnUnique to fix something in Arrow format. We already have check for bad indices here:

/// Check that indexes are correct (protection against corrupted files)
/// Note that on null values index can be arbitrary value.
for (int64_t i = 0; i != chunk->length(); ++i)
{
if (!chunk->IsNull(i) && (data[i] < 0 || data[i] >= dict_size))
throw Exception(ErrorCodes::INCORRECT_DATA,
"Index {} in Dictionary column is out of bounds, dictionary size is {}",
Int64(data[i]), UInt64(dict_size));
}

But this check is not really good for 2 reasons:

  1. dict_size is taked from arrow dictionary size
    dict_info.dictionary_size = arrow_dict_column->length();
    but we need to set our LC dictionary size instead
  2. It doesn't take into account possible shift of indices that may happen in this function next.

We need to fix these 2 problems and there will be no need to change ColumnUnique

@ilejn
Copy link
Contributor Author

ilejn commented Sep 30, 2025

@Avogar , thank you for fast response.

I am not 100% sure, but we probably have same problem in ORC (but there is no check that we have for Arrow you are referring to).
After all, throwing exception if data is not unique seems natural for a method called uniqueInsertRangeFrom.

It seems that getting dict_info.dictionary_size from arrow has advantages (nulls, if I am not mistaken).

Please, reconfirm if you do think that we should keep uniqueInsertRangeFrom intact and make checks at Arrow level.

@Avogar
Copy link
Member

Avogar commented Sep 30, 2025

After all, throwing exception if data is not unique seems natural for a method called uniqueInsertRangeFrom.

It's not natural. uniqueInsertRangeFrom is also used to insert data into LC from not LC column where we can have duplicates, that's why it ignores them in the first place.

Actually, maybe we can allow inserting such data, we can just create a mapping from arrow indexes to our LC indexes based on the positions that are returned from uniqueInsertRangeFrom. We can even use this mapping in a general case and avoid this code with indexes shifting (if it will slow down the insertion significantly, we can have separate logic for case when have dupliacates and when we don't have).

@ilejn
Copy link
Contributor Author

ilejn commented Sep 30, 2025

Of course I realize that throwing exception cannot be the only behavior.
Anyway, I take your response as I have to check at Arrow level and keep uniqueInsertRangeFrom intact.
Ok, will do.

It seems that Arrow file with non unique dictionary is seriously broken, it cannot be read by pandas and suchlike.
That's why I don't think that we should care.

@Avogar
Copy link
Member

Avogar commented Sep 30, 2025

Probably the best way to check if dictionary contains duplicates at Arrow level - check for duplicates in positions column uniqueInsertRangeFrom returns. Then there is no need to change any other code.

Or even just compare the size of the arrow dictionary and the size of the dictinary created inside LC (but it will require a careful check for cases when we have/don't have default value in arrow dict and when dictionary is Nullable)

@ilejn
Copy link
Contributor Author

ilejn commented Sep 30, 2025

compare the size of the arrow dictionary and the size of the dictinary created inside LC (but it will require a careful check for cases when we have/don't have default value in arrow dict and when dictionary us Nullable)

That's the plan.
Actually I started from this approach before switching to what I currently propose.

@ilejn
Copy link
Contributor Author

ilejn commented Sep 30, 2025

Unfortunately there is no obvious way to create a file with default values via pyarrow, but the check seems to be simple and transparent.

@ilejn
Copy link
Contributor Author

ilejn commented Oct 2, 2025

Hello @Avogar , can we merge it, or you feel that we need additional tests or something?

@Avogar Avogar added this pull request to the merge queue Oct 3, 2025
Merged via the queue into ClickHouse:master with commit 20b0f8d Oct 3, 2025
123 checks passed
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Oct 3, 2025
@aadant
Copy link

aadant commented Oct 3, 2025

@ilejn see this discussion on Arrow

apache/arrow#47134
apache/arrow#47151

@ilejn
Copy link
Contributor Author

ilejn commented Oct 3, 2025

@ilejn see this discussion on Arrow

apache/arrow#47134 apache/arrow#47151

Thanks, interesting.

ilejn pushed a commit to Altinity/ClickHouse that referenced this pull request Oct 6, 2025
ArrowStream processing crash if non unique dictionary
zvonand added a commit to Altinity/ClickHouse that referenced this pull request Oct 10, 2025
24.8.14 Backport of ClickHouse#87863: ArrowStream processing crash if non unique dictionary
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants