Skip to content

Add groupSortedArray() function#34055

Merged
evillique merged 13 commits intoClickHouse:masterfrom
palegre-tiny:groupSortedArray
Mar 30, 2022
Merged

Add groupSortedArray() function#34055
evillique merged 13 commits intoClickHouse:masterfrom
palegre-tiny:groupSortedArray

Conversation

@palegre-tiny
Copy link
Contributor

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
New aggregation function groupSortedArray to obtain an array of first N values.

@CLAassistant
Copy link

CLAassistant commented Jan 27, 2022

CLA assistant check
All committers have signed the CLA.

@palegre-tiny
Copy link
Contributor Author

Detailed:
Added a new aggregation function:

Usage 1:
groupSortedArray(n)(param1)

It returns an array with the n first values from a field param1 sorted by itself.

 select groupSortedArray(2)(number) from numbers(100)
┌─groupSortedArray(2)(number)─┐
│ [0,1]                       │
└─────────────────────────────┘

Usage 2:
groupSortedArray(n)(param1, param2)

It returns an array with the n first values from a field param1 sorted param2 (field or expression).

SELECT groupSortedArray(5)(number, number) FROM numbers(1000000)

SELECT groupSortedArray(5)(number, -number) FROM numbers(100)
┌─groupSortedArray(5)(number, negate(number))─┐
│ [99,98,97,96,95]                            │
└─────────────────────────────────────────────┘

@palegre-tiny palegre-tiny reopened this Jan 27, 2022
@evillique evillique added the can be tested Allows running workflows for external contributors label Jan 27, 2022
@evillique evillique self-assigned this Jan 27, 2022
@UnamedRus
Copy link
Contributor

What if we want to sort by multiple columns?

SELECT groupSortedArray(5)(number, -number, number % 10) FROM numbers(100)

SELECT groupSortedArray(5)(number, (-number, number % 10)) FROM numbers(100)

I like first approach more, because sorting of tuple much slower.

@palegre-tiny
Copy link
Contributor Author

palegre-tiny commented Jan 27, 2022

What if we want to sort by multiple columns?

SELECT groupSortedArray(5)(number, -number, number % 10) FROM numbers(100)

SELECT groupSortedArray(5)(number, (-number, number % 10)) FROM numbers(100)

I like first approach more, because sorting of tuple much slower.

It is not supported. Initially it only accepts integer expressions for this second parameter.

@palegre-tiny
Copy link
Contributor Author

palegre-tiny commented Feb 3, 2022

What if we want to sort by multiple columns?

SELECT groupSortedArray(5)(number, -number, number % 10) FROM numbers(100)

SELECT groupSortedArray(5)(number, (-number, number % 10)) FROM numbers(100)

I like first approach more, because sorting of tuple much slower.

It is not supported. Initially it only accepts integer expressions for this second parameter.

I'm preparing a fixup supporting mode 2. Sorting by any field/tuple.

@palegre-tiny palegre-tiny force-pushed the groupSortedArray branch 2 times, most recently from 31f3c8a to ba7a4bd Compare February 3, 2022 17:26
- Fix memory access
- Support any type as sorting parameter
- Fix tests
- Rewrite/simplify function addBatchSinglePlace
@alexey-milovidov
Copy link
Member

If the function named groupSortedArray, it may sound like
"group sorted items into array" - incorrect
"group items into sorted array" - correct

If the function named groupArraySorted, it will sound like
"group items into array and get it sorted" - correct

@filimonov
Copy link
Contributor

See also #3708


void merge(const AggregateFunctionGroupSortedArrayDataBase & other)
{
values.insert(other.values.begin(), other.values.end());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both sides are sorted. Why not merge sort? Also with limit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will try to do like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean using map::merge / set::merge...? I'm not sure I can use them here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it is currently using merge sort inside std::map or std::set. I have modified it to use std::...::merge but I have to duplicate it first because it doesn't accept const objects. If it is not enough (maybe it is not good for small array but enough for large ones) I guess I could maintain the results list manually instead of using std::map/set.

@palegre-tiny
Copy link
Contributor Author

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Feb 14, 2022

update

✅ Branch has been successfully updated

@palegre-tiny
Copy link
Contributor Author

Thank you for your inputs, I have been adding the requested modifications and I think it is again ready for review.

List of modifications:

  • Renamed to groupArraySorted
  • New selection function for long arrays size (More than 1000)
  • Added performance test for long arrays size
  • Fixed memory access issue
  • Simplified aggregation class creation methods
  • Fixed syntax checks and other CI issues
  • Added support for any type in the second parameter (sorted by)
  • Rewritten addBatchSinglePlace() function to remove dynamic allocations and to make the code mode understandable

I can still see a failing check but I think it is not related with my code. (test_s3_zero_copy_replication/test.py::test_s3_zero_copy_drop_detached - FAILED)

@jkuklis
Copy link
Contributor

jkuklis commented Feb 23, 2022

Hey!

I would like to ask: @palegre-tiny and PR reviewers, do you know whether it would be difficult to add support for:

  • groupArraySorted -Array extensions, like groupArraySortedArray and groupUniqArraySortedArray,
  • SimpleAggregateFunction(groupArraySorted) and with -Array extensions?

For context, we will likely need these features for a project that we discuss currently at our company. We were willing to implement these functionalities from scratch on our own, then it was good news for us that somebody else is working on a similar design, so that we have a better starting point and a source of inspiration.

We would like to use SimpleAggregateFunction(groupArraySortedArray) and SimpleAggregateFunction(groupUniqArraySortedArray) in AggregatingMergeTree. Extensions with -Array flatten the input arrays into a single output array, and SimpleAggregateFunction is a wrapper for aggregate functions like groupArrayArray.

Could you share some thoughts about these extensions above:

  • Can we reuse the design for groupArraySorted for -Array extensions? Are there any obstacles in the current design?
  • Can we simply enable groupArraySorted as a valid choice for SimpleAggregateFunction in checkSupportedFunctions in src/DataTypes/DataTypeCustomSimpleAggregateFunction.cpp? Edit: this leads to Incompatible data types between aggregate function 'groupArraySorted' which returns Array(Array(String)) and column storage type Array(String), which is natural I guess, as a function with -Array suffix is needed to flatten the result.

There are likely many details to work out here anyway, but if you have enough context to give us hints, it would be great!

Thanks!

Jakub Kuklis
Contentsquare

@palegre-tiny
Copy link
Contributor Author

palegre-tiny commented Feb 24, 2022

Thanks Jakub for your comments.

  1. About groupArraySorted +Array extensions

If I understood correctly you suggest something like this:

table_test_array
┌─a───────────┬─b─┐
│ [1,3,5,7,9] │ 2 │
│ [0,2,4,6,8] │ 1 │
└─────────────┴───┘

groupArraySortedArray(a) from table_test_array
┌─groupArraySortedArray(a)─┐
│ [0,1,2,3,4]              │
└──────────────────────────┘

And that is clear, but what if we want to use second parameter (sort expression), we would do the following:

groupArraySortedArray(a,b) from table_test_array
┌─groupArraySortedArray(a,b)─┐
│ [0,2,4,6,8]                │
└────────────────────────────┘

this case we would not split the arrays because each has different weight.

It is actually working like that but I have doubts about if it is clear from ussage perspective.

  1. Use with SimpleAggregateFunction

I don't think we can use it, again when using 2 parameters we need to store the weight (sort expression) attached to the values, even if at the end we only show the values.

@palegre-tiny
Copy link
Contributor Author

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Feb 25, 2022

update

✅ Branch has been successfully updated

@palegre-tiny
Copy link
Contributor Author

For sorted by second parameter usage + array extension: It is necessary to use array in both parameters:

select groupArraySortedArray([1, 2, 3], [10, 0, 5])

┌─groupArraySortedArray([1, 2, 3], [10, 0, 5])─┐
│ [2,3,1]                                      │
└──────────────────────────────────────────────┘

@palegre-tiny
Copy link
Contributor Author

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Mar 21, 2022

update

✅ Branch has been successfully updated

@palegre-tiny
Copy link
Contributor Author

Hi @filimonov @evillique,
It's been a while since the last review/comment, is there anything I can do to continue this?
Thank you

@palegre-tiny
Copy link
Contributor Author

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Mar 28, 2022

update

✅ Branch has been successfully updated

@evillique evillique merged commit f055d7b into ClickHouse:master Mar 30, 2022

template <typename TColumn, typename TFilter = void>
size_t
getFirstNElements_low_threshold(const TColumn * data, int num_elements, int threshold, size_t * results, const TFilter * filter = nullptr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong style.

continue;
}

//Starting from the highest values and we look for the immediately lower than the given one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evillique Wrong style.

@alexey-milovidov
Copy link
Member

@evillique I did not understand the algorithm of this function in five seconds. Let's move it under "experimental" flag.

azat added a commit to azat/ClickHouse that referenced this pull request Apr 29, 2022
groupArraySorted() merge operation was incorrect, it does not moves data
to new arena, and hence triggers use-after-free.

ASan report [1]:

    ==103==ERROR: AddressSanitizer: heap-use-after-free on address 0x62100c12022c at pc 0x00000d49cbcc bp 0x7f345c320e40 sp 0x7f345c3205e8
    READ of size 12 at 0x62100c12022c thread T244 (QueryPipelineEx)
        7 0x143e19e4 in DB::AggregateFunctionGroupArraySortedDataBase<>::merge() build_docker/../src/AggregateFunctions/AggregateFunctionGroupArraySortedData.h:69:16
        8 0x2c80296e in DB::ColumnAggregateFunction::insertMergeFrom() build_docker/../src/Columns/ColumnAggregateFunction.cpp:470:11
        9 0x2ec61996 in DB::TotalsHavingTransform::addToTotals() build_docker/../src/Processors/Transforms/TotalsHavingTransform.cpp:283:35
        10 0x2ec5f400 in DB::TotalsHavingTransform::transform() build_docker/../src/Processors/Transforms/TotalsHavingTransform.cpp:188:9

    0x62100c12022c is located 300 bytes inside of 4096-byte region [0x62100c120100,0x62100c121100)
    freed by thread T244 (QueryPipelineEx) here:
        14 0x2c7f8ecd in DB::ColumnAggregateFunction::~ColumnAggregateFunction() build_docker/../src/Columns/ColumnAggregateFunction.cpp:85:1
        26 0x2d23ffc3 in DB::Chunk::operator=(DB::Chunk&&) build_docker/../src/Processors/Chunk.h:53:17
        27 0x2ec5f410 in DB::TotalsHavingTransform::transform(DB::Chunk&) build_docker/../src/Processors/Transforms/TotalsHavingTransform.cpp:189:15

    previously allocated by thread T3 (TCPHandler) here:
        18 0x2ed80019 in DB::AggregatingInOrderTransform::AggregatingInOrderTransform() build_docker/../src/Processors/Transforms/AggregatingInOrderTransform.cpp:20:9

    SUMMARY: AddressSanitizer: heap-use-after-free crtstuff.c in MemcmpInterceptorCommon(void*, int (*)(void const*, void const*, unsigned long), void const*, void const*, unsigned long)

  [1]: https://s3.amazonaws.com/clickhouse-test-reports/35111/0ce44f30210d362c3436f03e926bf7893b034f06/fuzzer_astfuzzerasan,actions//report.html

Fixes: ClickHouse#34055 (cc @palegre-tiny @evillique)
Signed-off-by: Azat Khuzhin <[email protected]>

void merge(const AggregateFunctionGroupArraySortedDataBase & other)
{
values.merge(Storage(other.values));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This merge operator may lead to use-after-free in case of StringRef usage, fix is here - #36815

alexey-milovidov added a commit that referenced this pull request Apr 30, 2022
alexey-milovidov added a commit that referenced this pull request May 6, 2022
Revert "Merge pull request #34055 from palegre-tiny/groupSortedArray"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants