ArrowStream processing crash if non unique dictionary#87863
ArrowStream processing crash if non unique dictionary#87863Avogar merged 6 commits intoClickHouse:masterfrom
Conversation
|
Hello Pavel @Avogar , |
|
I don't like that we change anything in ClickHouse/src/Processors/Formats/Impl/ArrowColumnToCHColumn.cpp Lines 649 to 657 in c0347e6 But this check is not really good for 2 reasons:
We need to fix these 2 problems and there will be no need to change |
|
@Avogar , thank you for fast response. I am not 100% sure, but we probably have same problem in ORC (but there is no check that we have for Arrow you are referring to). It seems that getting dict_info.dictionary_size from arrow has advantages (nulls, if I am not mistaken). Please, reconfirm if you do think that we should keep uniqueInsertRangeFrom intact and make checks at Arrow level. |
It's not natural. Actually, maybe we can allow inserting such data, we can just create a mapping from arrow indexes to our LC indexes based on the positions that are returned from |
|
Of course I realize that throwing exception cannot be the only behavior. It seems that Arrow file with non unique dictionary is seriously broken, it cannot be read by pandas and suchlike. |
|
Probably the best way to check if dictionary contains duplicates at Arrow level - check for duplicates in positions column Or even just compare the size of the arrow dictionary and the size of the dictinary created inside LC (but it will require a careful check for cases when we have/don't have default value in arrow dict and when dictionary is Nullable) |
That's the plan. |
|
Unfortunately there is no obvious way to create a file with default values via pyarrow, but the check seems to be simple and transparent. |
|
Hello @Avogar , can we merge it, or you feel that we need additional tests or something? |
|
@ilejn see this discussion on Arrow |
Thanks, interesting. |
ArrowStream processing crash if non unique dictionary
24.8.14 Backport of ClickHouse#87863: ArrowStream processing crash if non unique dictionary
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
ClickHouse crashes if ArrowStream file has non-unique dictionary
Documentation entry for user-facing changes
The issue was observed in prod environment (no ideas how data was created).
I've created sample file by the script
The point is 'banana' is duplicated.
ClickHouse creates LowCardinality column for two meaningful elements ('apple' and 'banana') and index '2' is out of bound.
Backtrace