Add groupSortedArray() function#34055
Conversation
|
Detailed: Usage 1: It returns an array with the n first values from a field param1 sorted by itself. Usage 2: It returns an array with the n first values from a field param1 sorted param2 (field or expression). SELECT groupSortedArray(5)(number, number) FROM numbers(1000000) |
|
What if we want to sort by multiple columns? I like first approach more, because sorting of tuple much slower. |
It is not supported. Initially it only accepts integer expressions for this second parameter. |
I'm preparing a fixup supporting mode 2. Sorting by any field/tuple. |
31f3c8a to
ba7a4bd
Compare
- Fix memory access - Support any type as sorting parameter - Fix tests - Rewrite/simplify function addBatchSinglePlace
ba7a4bd to
7f553d5
Compare
|
If the function named If the function named |
|
See also #3708 |
docs/en/sql-reference/aggregate-functions/reference/groupsortedarray.md
Outdated
Show resolved
Hide resolved
docs/en/sql-reference/aggregate-functions/reference/groupsortedarray.md
Outdated
Show resolved
Hide resolved
|
|
||
| void merge(const AggregateFunctionGroupSortedArrayDataBase & other) | ||
| { | ||
| values.insert(other.values.begin(), other.values.end()); |
There was a problem hiding this comment.
Both sides are sorted. Why not merge sort? Also with limit
There was a problem hiding this comment.
Thanks, I will try to do like that.
There was a problem hiding this comment.
Do you mean using map::merge / set::merge...? I'm not sure I can use them here.
There was a problem hiding this comment.
I'm not sure if it is currently using merge sort inside std::map or std::set. I have modified it to use std::...::merge but I have to duplicate it first because it doesn't accept const objects. If it is not enough (maybe it is not good for small array but enough for large ones) I guess I could maintain the results list manually instead of using std::map/set.
fixup! Add groupSortedArray() function fixup! Add groupSortedArray() function
|
@Mergifyio update |
✅ Branch has been successfully updated |
|
Thank you for your inputs, I have been adding the requested modifications and I think it is again ready for review. List of modifications:
I can still see a failing check but I think it is not related with my code. (test_s3_zero_copy_replication/test.py::test_s3_zero_copy_drop_detached - FAILED) |
|
Hey! I would like to ask: @palegre-tiny and PR reviewers, do you know whether it would be difficult to add support for:
For context, we will likely need these features for a project that we discuss currently at our company. We were willing to implement these functionalities from scratch on our own, then it was good news for us that somebody else is working on a similar design, so that we have a better starting point and a source of inspiration. We would like to use Could you share some thoughts about these extensions above:
There are likely many details to work out here anyway, but if you have enough context to give us hints, it would be great! Thanks! Jakub Kuklis |
|
Thanks Jakub for your comments.
If I understood correctly you suggest something like this: table_test_array
┌─a───────────┬─b─┐
│ [1,3,5,7,9] │ 2 │
│ [0,2,4,6,8] │ 1 │
└─────────────┴───┘
groupArraySortedArray(a) from table_test_array
┌─groupArraySortedArray(a)─┐
│ [0,1,2,3,4] │
└──────────────────────────┘And that is clear, but what if we want to use second parameter (sort expression), we would do the following: groupArraySortedArray(a,b) from table_test_array
┌─groupArraySortedArray(a,b)─┐
│ [0,2,4,6,8] │
└────────────────────────────┘this case we would not split the arrays because each has different weight. It is actually working like that but I have doubts about if it is clear from ussage perspective.
I don't think we can use it, again when using 2 parameters we need to store the weight (sort expression) attached to the values, even if at the end we only show the values. |
|
@Mergifyio update |
✅ Branch has been successfully updated |
|
For sorted by second parameter usage + array extension: It is necessary to use array in both parameters: select groupArraySortedArray([1, 2, 3], [10, 0, 5])
┌─groupArraySortedArray([1, 2, 3], [10, 0, 5])─┐
│ [2,3,1] │
└──────────────────────────────────────────────┘ |
|
@Mergifyio update |
✅ Branch has been successfully updated |
|
Hi @filimonov @evillique, |
|
@Mergifyio update |
✅ Branch has been successfully updated |
|
|
||
| template <typename TColumn, typename TFilter = void> | ||
| size_t | ||
| getFirstNElements_low_threshold(const TColumn * data, int num_elements, int threshold, size_t * results, const TFilter * filter = nullptr) |
| continue; | ||
| } | ||
|
|
||
| //Starting from the highest values and we look for the immediately lower than the given one |
|
@evillique I did not understand the algorithm of this function in five seconds. Let's move it under "experimental" flag. |
groupArraySorted() merge operation was incorrect, it does not moves data
to new arena, and hence triggers use-after-free.
ASan report [1]:
==103==ERROR: AddressSanitizer: heap-use-after-free on address 0x62100c12022c at pc 0x00000d49cbcc bp 0x7f345c320e40 sp 0x7f345c3205e8
READ of size 12 at 0x62100c12022c thread T244 (QueryPipelineEx)
7 0x143e19e4 in DB::AggregateFunctionGroupArraySortedDataBase<>::merge() build_docker/../src/AggregateFunctions/AggregateFunctionGroupArraySortedData.h:69:16
8 0x2c80296e in DB::ColumnAggregateFunction::insertMergeFrom() build_docker/../src/Columns/ColumnAggregateFunction.cpp:470:11
9 0x2ec61996 in DB::TotalsHavingTransform::addToTotals() build_docker/../src/Processors/Transforms/TotalsHavingTransform.cpp:283:35
10 0x2ec5f400 in DB::TotalsHavingTransform::transform() build_docker/../src/Processors/Transforms/TotalsHavingTransform.cpp:188:9
0x62100c12022c is located 300 bytes inside of 4096-byte region [0x62100c120100,0x62100c121100)
freed by thread T244 (QueryPipelineEx) here:
14 0x2c7f8ecd in DB::ColumnAggregateFunction::~ColumnAggregateFunction() build_docker/../src/Columns/ColumnAggregateFunction.cpp:85:1
26 0x2d23ffc3 in DB::Chunk::operator=(DB::Chunk&&) build_docker/../src/Processors/Chunk.h:53:17
27 0x2ec5f410 in DB::TotalsHavingTransform::transform(DB::Chunk&) build_docker/../src/Processors/Transforms/TotalsHavingTransform.cpp:189:15
previously allocated by thread T3 (TCPHandler) here:
18 0x2ed80019 in DB::AggregatingInOrderTransform::AggregatingInOrderTransform() build_docker/../src/Processors/Transforms/AggregatingInOrderTransform.cpp:20:9
SUMMARY: AddressSanitizer: heap-use-after-free crtstuff.c in MemcmpInterceptorCommon(void*, int (*)(void const*, void const*, unsigned long), void const*, void const*, unsigned long)
[1]: https://s3.amazonaws.com/clickhouse-test-reports/35111/0ce44f30210d362c3436f03e926bf7893b034f06/fuzzer_astfuzzerasan,actions//report.html
Fixes: ClickHouse#34055 (cc @palegre-tiny @evillique)
Signed-off-by: Azat Khuzhin <[email protected]>
|
|
||
| void merge(const AggregateFunctionGroupArraySortedDataBase & other) | ||
| { | ||
| values.merge(Storage(other.values)); |
There was a problem hiding this comment.
This merge operator may lead to use-after-free in case of StringRef usage, fix is here - #36815
Revert "Merge pull request #34055 from palegre-tiny/groupSortedArray"
Changelog category (leave one):
New Feature
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
New aggregation function groupSortedArray to obtain an array of first N values.