Skip to content

Test new join#80

Merged
justinGilmer merged 12 commits intostagingfrom
test_new_join
Feb 23, 2024
Merged

Test new join#80
justinGilmer merged 12 commits intostagingfrom
test_new_join

Conversation

@justinGilmer
Copy link
Copy Markdown

When the amount of streams in a streamset grows into the 100s+, the join logic for windows and aligned windows queries became a computational burden due to the join logic in pyarrow for tables only operating on a table at a time.

Since we can just join on the 'time' column, a simpler approach is to iterate through all windowed data, get a unique sorted list of all timestamps, preallocate a null arrow table for all data with the time column being all the unique sorted timestamps above, and then take all the values that are returned from the windows queries and replace the null entries with their available data. This is needed because aligned windows queries will return an empty table for timeranges where there are no data present, while windows queries will return an entry for every timestamp.

This approach scales well as the number of streams increases in terms of run time, for 1000 streams its approximately 1.75-2x faster than the previous approach.

@justinGilmer justinGilmer requested a review from jleifnf February 23, 2024 17:28
@justinGilmer justinGilmer merged commit 405fb6f into staging Feb 23, 2024
@justinGilmer justinGilmer deleted the test_new_join branch February 23, 2024 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants