Add groupSortedArray() function by palegre-tiny · Pull Request #34055 · ClickHouse/ClickHouse

palegre-tiny · 2022-01-27T11:31:48Z

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
New aggregation function groupSortedArray to obtain an array of first N values.

CLAassistant · 2022-01-27T11:32:28Z

All committers have signed the CLA.

palegre-tiny · 2022-01-27T11:35:12Z

Detailed:
Added a new aggregation function:

Usage 1:
groupSortedArray(n)(param1)

It returns an array with the n first values from a field param1 sorted by itself.

 select groupSortedArray(2)(number) from numbers(100)
┌─groupSortedArray(2)(number)─┐
│ [0,1]                       │
└─────────────────────────────┘

Usage 2:
groupSortedArray(n)(param1, param2)

It returns an array with the n first values from a field param1 sorted param2 (field or expression).

SELECT groupSortedArray(5)(number, number) FROM numbers(1000000)

SELECT groupSortedArray(5)(number, -number) FROM numbers(100)
┌─groupSortedArray(5)(number, negate(number))─┐
│ [99,98,97,96,95]                            │
└─────────────────────────────────────────────┘

UnamedRus · 2022-01-27T11:44:43Z

What if we want to sort by multiple columns?

SELECT groupSortedArray(5)(number, -number, number % 10) FROM numbers(100)

SELECT groupSortedArray(5)(number, (-number, number % 10)) FROM numbers(100)

I like first approach more, because sorting of tuple much slower.

palegre-tiny · 2022-01-27T11:59:48Z

What if we want to sort by multiple columns?
SELECT groupSortedArray(5)(number, -number, number % 10) FROM numbers(100)

SELECT groupSortedArray(5)(number, (-number, number % 10)) FROM numbers(100)
I like first approach more, because sorting of tuple much slower.

It is not supported. Initially it only accepts integer expressions for this second parameter.

docs/en/sql-reference/aggregate-functions/reference/groupsortedarray.md

src/AggregateFunctions/AggregateFunctionGroupSortedArray.h

palegre-tiny · 2022-02-03T10:49:08Z

What if we want to sort by multiple columns?
SELECT groupSortedArray(5)(number, -number, number % 10) FROM numbers(100)

SELECT groupSortedArray(5)(number, (-number, number % 10)) FROM numbers(100)
I like first approach more, because sorting of tuple much slower.
It is not supported. Initially it only accepts integer expressions for this second parameter.

I'm preparing a fixup supporting mode 2. Sorting by any field/tuple.

- Fix memory access - Support any type as sorting parameter - Fix tests - Rewrite/simplify function addBatchSinglePlace

alexey-milovidov · 2022-02-05T16:10:40Z

If the function named groupSortedArray, it may sound like
"group sorted items into array" - incorrect
"group items into sorted array" - correct

If the function named groupArraySorted, it will sound like
"group items into array and get it sorted" - correct

filimonov · 2022-02-08T05:30:10Z

✅ Branch has been successfully updated

palegre-tiny · 2022-02-15T14:16:46Z

Thank you for your inputs, I have been adding the requested modifications and I think it is again ready for review.

List of modifications:

Renamed to groupArraySorted
New selection function for long arrays size (More than 1000)
Added performance test for long arrays size
Fixed memory access issue
Simplified aggregation class creation methods
Fixed syntax checks and other CI issues
Added support for any type in the second parameter (sorted by)
Rewritten addBatchSinglePlace() function to remove dynamic allocations and to make the code mode understandable

I can still see a failing check but I think it is not related with my code. (test_s3_zero_copy_replication/test.py::test_s3_zero_copy_drop_detached - FAILED)

jkuklis · 2022-02-23T10:26:54Z

Hey!

I would like to ask: @palegre-tiny and PR reviewers, do you know whether it would be difficult to add support for:

groupArraySorted -Array extensions, like groupArraySortedArray and groupUniqArraySortedArray,
SimpleAggregateFunction(groupArraySorted) and with -Array extensions?

For context, we will likely need these features for a project that we discuss currently at our company. We were willing to implement these functionalities from scratch on our own, then it was good news for us that somebody else is working on a similar design, so that we have a better starting point and a source of inspiration.

We would like to use SimpleAggregateFunction(groupArraySortedArray) and SimpleAggregateFunction(groupUniqArraySortedArray) in AggregatingMergeTree. Extensions with -Array flatten the input arrays into a single output array, and SimpleAggregateFunction is a wrapper for aggregate functions like groupArrayArray.

Could you share some thoughts about these extensions above:

Can we reuse the design for groupArraySorted for -Array extensions? Are there any obstacles in the current design?
Can we simply enable groupArraySorted as a valid choice for SimpleAggregateFunction in checkSupportedFunctions in src/DataTypes/DataTypeCustomSimpleAggregateFunction.cpp? Edit: this leads to Incompatible data types between aggregate function 'groupArraySorted' which returns Array(Array(String)) and column storage type Array(String), which is natural I guess, as a function with -Array suffix is needed to flatten the result.

There are likely many details to work out here anyway, but if you have enough context to give us hints, it would be great!

Thanks!

Jakub Kuklis
Contentsquare

palegre-tiny · 2022-02-24T16:59:42Z

Thanks Jakub for your comments.

About groupArraySorted +Array extensions

If I understood correctly you suggest something like this:

table_test_array
┌─a───────────┬─b─┐
│ [1,3,5,7,9] │ 2 │
│ [0,2,4,6,8] │ 1 │
└─────────────┴───┘

groupArraySortedArray(a) from table_test_array
┌─groupArraySortedArray(a)─┐
│ [0,1,2,3,4]              │
└──────────────────────────┘

And that is clear, but what if we want to use second parameter (sort expression), we would do the following:

groupArraySortedArray(a,b) from table_test_array
┌─groupArraySortedArray(a,b)─┐
│ [0,2,4,6,8]                │
└────────────────────────────┘

this case we would not split the arrays because each has different weight.

It is actually working like that but I have doubts about if it is clear from ussage perspective.

Use with SimpleAggregateFunction

I don't think we can use it, again when using 2 parameters we need to store the weight (sort expression) attached to the values, even if at the end we only show the values.

palegre-tiny · 2022-02-25T11:45:27Z

@Mergifyio update

mergify · 2022-02-25T11:45:48Z

update

✅ Branch has been successfully updated

palegre-tiny · 2022-02-28T08:56:15Z

For sorted by second parameter usage + array extension: It is necessary to use array in both parameters:

select groupArraySortedArray([1, 2, 3], [10, 0, 5])

┌─groupArraySortedArray([1, 2, 3], [10, 0, 5])─┐
│ [2,3,1]                                      │
└──────────────────────────────────────────────┘

palegre-tiny · 2022-03-21T13:34:25Z

@Mergifyio update

mergify · 2022-03-21T13:35:07Z

update

✅ Branch has been successfully updated

palegre-tiny · 2022-03-21T18:51:40Z

Hi @filimonov @evillique,
It's been a while since the last review/comment, is there anything I can do to continue this?
Thank you

palegre-tiny · 2022-03-28T20:56:25Z

@Mergifyio update

mergify · 2022-03-28T20:57:07Z

update

✅ Branch has been successfully updated

alexey-milovidov · 2022-04-15T23:35:36Z

src/AggregateFunctions/AggregateFunctionGroupArraySorted.h

+
+template <typename TColumn, typename TFilter = void>
+size_t
+getFirstNElements_low_threshold(const TColumn * data, int num_elements, int threshold, size_t * results, const TFilter * filter = nullptr)


Wrong style.

alexey-milovidov · 2022-04-15T23:35:52Z

src/AggregateFunctions/AggregateFunctionGroupArraySorted.h

+                continue;
+        }
+
+        //Starting from the highest values and we look for the immediately lower than the given one


@evillique Wrong style.

alexey-milovidov · 2022-04-15T23:38:10Z

@evillique I did not understand the algorithm of this function in five seconds. Let's move it under "experimental" flag.

@palegre-tiny

groupArraySorted() merge operation was incorrect, it does not moves data to new arena, and hence triggers use-after-free. ASan report [1]: ==103==ERROR: AddressSanitizer: heap-use-after-free on address 0x62100c12022c at pc 0x00000d49cbcc bp 0x7f345c320e40 sp 0x7f345c3205e8 READ of size 12 at 0x62100c12022c thread T244 (QueryPipelineEx) 7 0x143e19e4 in DB::AggregateFunctionGroupArraySortedDataBase<>::merge() build_docker/../src/AggregateFunctions/AggregateFunctionGroupArraySortedData.h:69:16 8 0x2c80296e in DB::ColumnAggregateFunction::insertMergeFrom() build_docker/../src/Columns/ColumnAggregateFunction.cpp:470:11 9 0x2ec61996 in DB::TotalsHavingTransform::addToTotals() build_docker/../src/Processors/Transforms/TotalsHavingTransform.cpp:283:35 10 0x2ec5f400 in DB::TotalsHavingTransform::transform() build_docker/../src/Processors/Transforms/TotalsHavingTransform.cpp:188:9 0x62100c12022c is located 300 bytes inside of 4096-byte region [0x62100c120100,0x62100c121100) freed by thread T244 (QueryPipelineEx) here: 14 0x2c7f8ecd in DB::ColumnAggregateFunction::~ColumnAggregateFunction() build_docker/../src/Columns/ColumnAggregateFunction.cpp:85:1 26 0x2d23ffc3 in DB::Chunk::operator=(DB::Chunk&&) build_docker/../src/Processors/Chunk.h:53:17 27 0x2ec5f410 in DB::TotalsHavingTransform::transform(DB::Chunk&) build_docker/../src/Processors/Transforms/TotalsHavingTransform.cpp:189:15 previously allocated by thread T3 (TCPHandler) here: 18 0x2ed80019 in DB::AggregatingInOrderTransform::AggregatingInOrderTransform() build_docker/../src/Processors/Transforms/AggregatingInOrderTransform.cpp:20:9 SUMMARY: AddressSanitizer: heap-use-after-free crtstuff.c in MemcmpInterceptorCommon(void*, int (*)(void const*, void const*, unsigned long), void const*, void const*, unsigned long) [1]: https://s3.amazonaws.com/clickhouse-test-reports/35111/0ce44f30210d362c3436f03e926bf7893b034f06/fuzzer_astfuzzerasan,actions//report.html Fixes: ClickHouse#34055 (cc @palegre-tiny @evillique) Signed-off-by: Azat Khuzhin <[email protected]>

azat · 2022-04-29T15:06:57Z

src/AggregateFunctions/AggregateFunctionGroupArraySortedData.h

+
+    void merge(const AggregateFunctionGroupArraySortedDataBase & other)
+    {
+        values.merge(Storage(other.values));


This merge operator may lead to use-after-free in case of StringRef usage, fix is here - #36815

This reverts commit f055d7b, reversing changes made to 4ec3c35.

Revert "Merge pull request #34055 from palegre-tiny/groupSortedArray"

Add groupSortedArray() function

e813f64

palegre-tiny closed this Jan 27, 2022

palegre-tiny reopened this Jan 27, 2022

evillique added the can be tested Allows running workflows for external contributors label Jan 27, 2022

evillique self-assigned this Jan 27, 2022

UnamedRus mentioned this pull request Jan 30, 2022

TOTALS, BY and ORDER BY combinators for aggregate functions #34156

Open

evillique reviewed Feb 2, 2022

View reviewed changes

palegre-tiny force-pushed the groupSortedArray branch 2 times, most recently from 31f3c8a to ba7a4bd Compare February 3, 2022 17:26

fixup! Add groupSortedArray() function

7f553d5

- Fix memory access - Support any type as sorting parameter - Fix tests - Rewrite/simplify function addBatchSinglePlace

palegre-tiny force-pushed the groupSortedArray branch from ba7a4bd to 7f553d5 Compare February 4, 2022 15:00

fixup! Add groupSortedArray() function

9c38b1a

filimonov reviewed Feb 8, 2022

View reviewed changes

palegre-tiny added 4 commits February 10, 2022 16:24

fixup! Add groupSortedArray() function

de7bbc7

fixup! Add groupSortedArray() function fixup! Add groupSortedArray() function

fixup! Add groupSortedArray() function

1e4b504

fixup! Evolve to TinySum

c65eaa5

fixup! Evolve to TinySum

3e537a7

Merge branch 'master' into groupSortedArray

aab54f4

fixup! Add groupSortedArray() function

9466aaf

fixup! fixup! Add groupSortedArray() function

74f99f7

Merge branch 'master' into groupSortedArray

cd6f1d8

palegre-tiny requested review from evillique and filimonov March 14, 2022 07:52

Merge branch 'master' into groupSortedArray

5df84df

Merge branch 'master' into groupSortedArray

bf5a9dc

evillique approved these changes Mar 30, 2022

View reviewed changes

evillique merged commit f055d7b into ClickHouse:master Mar 30, 2022

amosbird mentioned this pull request Apr 14, 2022

proposal: groupArraySorted ? #3708

Closed

alexey-milovidov reviewed Apr 15, 2022

View reviewed changes

azat mentioned this pull request Apr 29, 2022

Fix groupArraySorted() for optimize_aggregation_in_order and WITH TOTALS #36815

Closed

azat reviewed Apr 29, 2022

View reviewed changes

alexey-milovidov added a commit that referenced this pull request Apr 30, 2022

Revert "Merge pull request #34055 from palegre-tiny/groupSortedArray"

d6c0de0

This reverts commit f055d7b, reversing changes made to 4ec3c35.

alexey-milovidov added a commit that referenced this pull request May 6, 2022

Merge pull request #36822 from ClickHouse/revert-group-array-sorted

d76ac4d

Revert "Merge pull request #34055 from palegre-tiny/groupSortedArray"

+                  void merge(const AggregateFunctionGroupSortedArrayDataBase & other)
+                  {
+                      values.insert(other.values.begin(), other.values.end());

Conversation

palegre-tiny commented Jan 27, 2022

Uh oh!

CLAassistant commented Jan 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

palegre-tiny commented Jan 27, 2022

Uh oh!

UnamedRus commented Jan 27, 2022

Uh oh!

palegre-tiny commented Jan 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

palegre-tiny commented Feb 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-milovidov commented Feb 5, 2022

Uh oh!

filimonov commented Feb 8, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

filimonov Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

palegre-tiny Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

palegre-tiny Feb 11, 2022

Choose a reason for hiding this comment

Uh oh!

palegre-tiny Feb 15, 2022

Choose a reason for hiding this comment

Uh oh!

palegre-tiny commented Feb 14, 2022

Uh oh!

mergify bot commented Feb 14, 2022

✅ Branch has been successfully updated

Uh oh!

palegre-tiny commented Feb 15, 2022

Uh oh!

jkuklis commented Feb 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

palegre-tiny commented Feb 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

palegre-tiny commented Feb 25, 2022

Uh oh!

mergify bot commented Feb 25, 2022

✅ Branch has been successfully updated

Uh oh!

palegre-tiny commented Feb 28, 2022

Uh oh!

palegre-tiny commented Mar 21, 2022

Uh oh!

mergify bot commented Mar 21, 2022

✅ Branch has been successfully updated

Uh oh!

palegre-tiny commented Mar 21, 2022

Uh oh!

palegre-tiny commented Mar 28, 2022

Uh oh!

mergify bot commented Mar 28, 2022

✅ Branch has been successfully updated

Uh oh!

alexey-milovidov Apr 15, 2022

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov Apr 15, 2022

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov commented Apr 15, 2022

CLAassistant commented Jan 27, 2022 •

edited

Loading

palegre-tiny commented Jan 27, 2022 •

edited

Loading

palegre-tiny commented Feb 3, 2022 •

edited

Loading

jkuklis commented Feb 23, 2022 •

edited

Loading

palegre-tiny commented Feb 24, 2022 •

edited

Loading