NumericIndexedVector: compressed aggregations & point-wise operations on sparse vectors#74193
Conversation
|
This is an automated comment for commit 44daf19 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
57838ae to
8cfe289
Compare
52c4f88 to
dfbe3e9
Compare
|
Asked @rienath for a review. |
45de950 to
9ba9178
Compare
|
@FriendLey thanks for the PR! I am currently reading the paper that you linked in the issue. It might take some time to review this, but be sure – I'm on it! |
Thanks for the update! I appreciate you taking the time to review it. |
|
Dear @rienath, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself. |
rienath
left a comment
There was a problem hiding this comment.
Thanks for the great PR! Impressive work overall, and the performance improvements highlighted in the paper are especially exciting!
I’ve reviewed the code thoroughly. Most suggestions focus on improving consistency, clarity, maintainability, and alignment with existing conventions in the codebase. The core logic looks solid, and it's clear a lot of thought went into the implementation.
A few higher-level points to consider:
-
Performance tests: Since your paper includes benchmarks, it would be fantastic to have those integrated under
tests/performanceas well. This would help us to quantify the speed ups and track the speed going forward. You can refer to this guide for details. -
Documentation: The headers could benefit from additional comments or brief design notes. The PR is quite complicated and the logic isn't always immediately obvious. That will make it easier for others to understand and build on this work.
-
CI: Looks like a few checks are failing. Could you take a look when you have a moment?
docs/en/sql-reference/functions/numeric-indexed-vector-functions.md
Outdated
Show resolved
Hide resolved
src/AggregateFunctions/AggregateFunctionGroupNumericIndexedVectorData.h
Outdated
Show resolved
Hide resolved
src/AggregateFunctions/AggregateFunctionGroupNumericIndexedVectorData.h
Outdated
Show resolved
Hide resolved
src/AggregateFunctions/AggregateFunctionGroupNumericIndexedVectorData.h
Outdated
Show resolved
Hide resolved
src/AggregateFunctions/AggregateFunctionGroupNumericIndexedVectorData.h
Outdated
Show resolved
Hide resolved
docs/en/sql-reference/functions/numeric-indexed-vector-functions.md
Outdated
Show resolved
Hide resolved
|
Hi @FriendLey @ucasfl! Just a quick check-in, this PR is still on our radar. Are you planning to continue working on it? Let us know if you need any help! |
Yes, @FriendLey is working on it. |
6509696 to
a19deda
Compare
15d6cd5 to
09f4e93
Compare
|
Thanks for all the updates so far! Could you please avoid rebasing or force-pushing this branch once review has started? When you rewrite commits mid-review, my comments go stale and I have to review everything again. I really appreciate your help in keeping the review efficient @FriendLey |
…ns.md Co-authored-by: Raufs Dunamalijevs <[email protected]>
…ns.md Co-authored-by: Raufs Dunamalijevs <[email protected]>
Co-authored-by: Raufs Dunamalijevs <[email protected]>
|
@rienath I have processed all the new comments. Please review them again when you are free. |
|
@FriendLey could you have a look at this converstation? |
@rienath I have corrected the description in the document, please review it when you are free. |
|
@FriendLey there is a failing test in CI ( |
I am working on this. I guess it is because fraction_bit_num is lost when transferring data from the replica node to the coordinator node in cloud scenario. |
|
@rienath The failed test has been fixed, please review this pr when you are free. |
|
@FriendLey thanks. Indeed, the test is green now. For my interest: how did |
@rienath The reason for the test failure is not the division operation, but the process of reading data in the ParallelReplicas scenario. In ParallelReplicas, the original data will be read by multiple replicas and then merged at the coordinator node. In the default constructor of BSINumericIndexedVector, integer_bit_num and fraction_bit_num are set to 40 and 24 respectively when ValueType is Float64. When read with ParallelReplicas, assume there are 2 replicas. If the result of one replica is empty, the data transmitted to the coordinator node is: The problem lies in the order of merge. Because the bit position has promote operation ( Different fraction_bit_num results in different results of subsequent operations. |
|
Stateless tests (amd_tsan, 3/3) — Failed: 1, Passed: 2627, Skipped: 56
Stateless tests (amd_tsan, s3 storage, 1/3) — Failed: 1, Passed: 2590, Skipped: 123
Both fail for reasons not related to this PR @FriendLey thank you for all the work! |
8b78e12
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
NumericIndexedVector: new vector data-structure backed by bit-sliced, Roaring-bitmap compression, together with more than 20 functions for building, analysing and point-wise arithmetic. Can cut storage and speed up joins, filters and aggregations on sparse data. Implements #70582 and “Large-Scale Metric Computation in Online Controlled Experiment Platform” paper by T. Xiong and Y. Wang from VLDB 2024
Documentation entry for user-facing changes