[Data] Fix file size ordering in download partitioning with multiple URI columns#58517
Merged
bveeramani merged 3 commits intomasterfrom Nov 16, 2025
Merged
[Data] Fix file size ordering in download partitioning with multiple URI columns#58517bveeramani merged 3 commits intomasterfrom
bveeramani merged 3 commits intomasterfrom
Conversation
…URI columns The _sample_sizes method was using as_completed() to collect file sizes, which returns results in completion order rather than submission order. This scrambled the file sizes list so it no longer corresponded to the input URI order. When multiple URI columns are used, _estimate_nrows_per_partition calls zip(*sampled_file_sizes_by_column.values()) which assumes file sizes from different columns align by row index. The scrambled ordering caused file sizes from different rows to be incorrectly combined, producing wrong partition size estimates. Fixed by pre-allocating the file_sizes list and using a future-to-index mapping to place results at their correct positions, preserving the original URI order regardless of completion order. Signed-off-by: Balaji Veeramani <[email protected]>
bveeramani
commented
Nov 10, 2025
Comment on lines
+197
to
+198
| f"Failed to download URI '{uri_path}' from column '{uri_column_name}' with error: {e}" | ||
| f"Failed to download URI '{uri_path}' from column " | ||
| f"'{uri_column_name}' with error: {e}" |
Member
Author
There was a problem hiding this comment.
Drive-by formatting fix
Contributor
There was a problem hiding this comment.
Code Review
This pull request correctly addresses a critical bug where file sizes were not being correctly ordered when fetched concurrently, leading to incorrect data partitioning. The fix, which involves using a future_to_index map to preserve the submission order, is a robust and standard solution for this kind of problem. The pre-allocation of the file_sizes list and the addition of an assertion are also good practices that improve code clarity and safety. I have one suggestion to further improve the robustness of the assertions.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Balaji Veeramani <[email protected]>
Collaborator
robertnishihara
left a comment
There was a problem hiding this comment.
Looks good to me other than the one comment I left.
iamjustinhsu
approved these changes
Nov 12, 2025
Signed-off-by: Balaji Veeramani <[email protected]>
Aydin-ab
pushed a commit
to Aydin-ab/ray-aydin
that referenced
this pull request
Nov 19, 2025
…URI columns (ray-project#58517) The `_sample_sizes` method was using `as_completed()` to collect file sizes, which returns results in completion order rather than submission order. This scrambled the file sizes list so it no longer corresponded to the input URI order. When multiple URI columns are used, `_estimate_nrows_per_partition` calls `zip(*sampled_file_sizes_by_column.values())` on line 284, which assumes file sizes from different columns align by row index. The scrambled ordering caused file sizes from different rows to be incorrectly combined, producing incorrect partition size estimates. ## Changes - Pre-allocate the `file_sizes` list with the correct size - Use a `future_to_file_index` mapping to track the original submission order - Place results at their correct positions regardless of completion order - Add assertion to verify list length matches expected size ## Related issues ray-project#58464 (comment) --------- Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Aydin Abiar <[email protected]>
ykdojo
pushed a commit
to ykdojo/ray
that referenced
this pull request
Nov 27, 2025
…URI columns (ray-project#58517) The `_sample_sizes` method was using `as_completed()` to collect file sizes, which returns results in completion order rather than submission order. This scrambled the file sizes list so it no longer corresponded to the input URI order. When multiple URI columns are used, `_estimate_nrows_per_partition` calls `zip(*sampled_file_sizes_by_column.values())` on line 284, which assumes file sizes from different columns align by row index. The scrambled ordering caused file sizes from different rows to be incorrectly combined, producing incorrect partition size estimates. ## Changes - Pre-allocate the `file_sizes` list with the correct size - Use a `future_to_file_index` mapping to track the original submission order - Place results at their correct positions regardless of completion order - Add assertion to verify list length matches expected size ## Related issues ray-project#58464 (comment) --------- Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: YK <[email protected]>
SheldonTsen
pushed a commit
to SheldonTsen/ray
that referenced
this pull request
Dec 1, 2025
…URI columns (ray-project#58517) The `_sample_sizes` method was using `as_completed()` to collect file sizes, which returns results in completion order rather than submission order. This scrambled the file sizes list so it no longer corresponded to the input URI order. When multiple URI columns are used, `_estimate_nrows_per_partition` calls `zip(*sampled_file_sizes_by_column.values())` on line 284, which assumes file sizes from different columns align by row index. The scrambled ordering caused file sizes from different rows to be incorrectly combined, producing incorrect partition size estimates. ## Changes - Pre-allocate the `file_sizes` list with the correct size - Use a `future_to_file_index` mapping to track the original submission order - Place results at their correct positions regardless of completion order - Add assertion to verify list length matches expected size ## Related issues ray-project#58464 (comment) --------- Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Future-Outlier
pushed a commit
to Future-Outlier/ray
that referenced
this pull request
Dec 7, 2025
…URI columns (ray-project#58517) The `_sample_sizes` method was using `as_completed()` to collect file sizes, which returns results in completion order rather than submission order. This scrambled the file sizes list so it no longer corresponded to the input URI order. When multiple URI columns are used, `_estimate_nrows_per_partition` calls `zip(*sampled_file_sizes_by_column.values())` on line 284, which assumes file sizes from different columns align by row index. The scrambled ordering caused file sizes from different rows to be incorrectly combined, producing incorrect partition size estimates. ## Changes - Pre-allocate the `file_sizes` list with the correct size - Use a `future_to_file_index` mapping to track the original submission order - Place results at their correct positions regardless of completion order - Add assertion to verify list length matches expected size ## Related issues ray-project#58464 (comment) --------- Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Future-Outlier <[email protected]>
peterxcli
pushed a commit
to peterxcli/ray
that referenced
this pull request
Feb 25, 2026
…URI columns (ray-project#58517) The `_sample_sizes` method was using `as_completed()` to collect file sizes, which returns results in completion order rather than submission order. This scrambled the file sizes list so it no longer corresponded to the input URI order. When multiple URI columns are used, `_estimate_nrows_per_partition` calls `zip(*sampled_file_sizes_by_column.values())` on line 284, which assumes file sizes from different columns align by row index. The scrambled ordering caused file sizes from different rows to be incorrectly combined, producing incorrect partition size estimates. ## Changes - Pre-allocate the `file_sizes` list with the correct size - Use a `future_to_file_index` mapping to track the original submission order - Place results at their correct positions regardless of completion order - Add assertion to verify list length matches expected size ## Related issues ray-project#58464 (comment) --------- Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: peterxcli <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The
_sample_sizesmethod was usingas_completed()to collect file sizes, which returns results in completion order rather than submission order. This scrambled the file sizes list so it no longer corresponded to the input URI order.When multiple URI columns are used,
_estimate_nrows_per_partitioncallszip(*sampled_file_sizes_by_column.values())on line 284, which assumes file sizes from different columns align by row index. The scrambled ordering caused file sizes from different rows to be incorrectly combined, producing incorrect partition size estimates.Changes
file_sizeslist with the correct sizefuture_to_file_indexmapping to track the original submission orderRelated issues
#58464 (comment)