Skip to content

[Data] Fix file size ordering in download partitioning with multiple URI columns#58517

Merged
bveeramani merged 3 commits intomasterfrom
fix-download-file-size-ordering
Nov 16, 2025
Merged

[Data] Fix file size ordering in download partitioning with multiple URI columns#58517
bveeramani merged 3 commits intomasterfrom
fix-download-file-size-ordering

Conversation

@bveeramani
Copy link
Member

@bveeramani bveeramani commented Nov 10, 2025

The _sample_sizes method was using as_completed() to collect file sizes, which returns results in completion order rather than submission order. This scrambled the file sizes list so it no longer corresponded to the input URI order.

When multiple URI columns are used, _estimate_nrows_per_partition calls zip(*sampled_file_sizes_by_column.values()) on line 284, which assumes file sizes from different columns align by row index. The scrambled ordering caused file sizes from different rows to be incorrectly combined, producing incorrect partition size estimates.

Changes

  • Pre-allocate the file_sizes list with the correct size
  • Use a future_to_file_index mapping to track the original submission order
  • Place results at their correct positions regardless of completion order
  • Add assertion to verify list length matches expected size

Related issues

#58464 (comment)

…URI columns

The _sample_sizes method was using as_completed() to collect file sizes,
which returns results in completion order rather than submission order.
This scrambled the file sizes list so it no longer corresponded to the
input URI order.

When multiple URI columns are used, _estimate_nrows_per_partition calls
zip(*sampled_file_sizes_by_column.values()) which assumes file sizes
from different columns align by row index. The scrambled ordering caused
file sizes from different rows to be incorrectly combined, producing
wrong partition size estimates.

Fixed by pre-allocating the file_sizes list and using a future-to-index
mapping to place results at their correct positions, preserving the
original URI order regardless of completion order.

Signed-off-by: Balaji Veeramani <[email protected]>
@bveeramani bveeramani requested a review from a team as a code owner November 10, 2025 23:31
Comment on lines +197 to +198
f"Failed to download URI '{uri_path}' from column '{uri_column_name}' with error: {e}"
f"Failed to download URI '{uri_path}' from column "
f"'{uri_column_name}' with error: {e}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by formatting fix

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a critical bug where file sizes were not being correctly ordered when fetched concurrently, leading to incorrect data partitioning. The fix, which involves using a future_to_index map to preserve the submission order, is a robust and standard solution for this kind of problem. The pre-allocation of the file_sizes list and the addition of an assertion are also good practices that improve code clarity and safety. I have one suggestion to further improve the robustness of the assertions.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Balaji Veeramani <[email protected]>
@bveeramani bveeramani added the go add ONLY when ready to merge, run all tests label Nov 11, 2025
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Nov 11, 2025
Copy link
Collaborator

@robertnishihara robertnishihara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me other than the one comment I left.

Signed-off-by: Balaji Veeramani <[email protected]>
@bveeramani bveeramani enabled auto-merge (squash) November 16, 2025 02:19
@github-actions github-actions bot disabled auto-merge November 16, 2025 02:19
@bveeramani bveeramani merged commit 4c70984 into master Nov 16, 2025
7 checks passed
@bveeramani bveeramani deleted the fix-download-file-size-ordering branch November 16, 2025 03:56
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…URI columns (ray-project#58517)

The `_sample_sizes` method was using `as_completed()` to collect file
sizes, which returns results in completion order rather than submission
order. This scrambled the file sizes list so it no longer corresponded
to the input URI order.

When multiple URI columns are used, `_estimate_nrows_per_partition`
calls `zip(*sampled_file_sizes_by_column.values())` on line 284, which
assumes file sizes from different columns align by row index. The
scrambled ordering caused file sizes from different rows to be
incorrectly combined, producing incorrect partition size estimates.

## Changes

- Pre-allocate the `file_sizes` list with the correct size
- Use a `future_to_file_index` mapping to track the original submission
order
- Place results at their correct positions regardless of completion
order
- Add assertion to verify list length matches expected size

## Related issues

ray-project#58464 (comment)

---------

Signed-off-by: Balaji Veeramani <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Aydin Abiar <[email protected]>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
…URI columns (ray-project#58517)

The `_sample_sizes` method was using `as_completed()` to collect file
sizes, which returns results in completion order rather than submission
order. This scrambled the file sizes list so it no longer corresponded
to the input URI order.

When multiple URI columns are used, `_estimate_nrows_per_partition`
calls `zip(*sampled_file_sizes_by_column.values())` on line 284, which
assumes file sizes from different columns align by row index. The
scrambled ordering caused file sizes from different rows to be
incorrectly combined, producing incorrect partition size estimates.

## Changes

- Pre-allocate the `file_sizes` list with the correct size
- Use a `future_to_file_index` mapping to track the original submission
order
- Place results at their correct positions regardless of completion
order
- Add assertion to verify list length matches expected size

## Related issues

ray-project#58464 (comment)

---------

Signed-off-by: Balaji Veeramani <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: YK <[email protected]>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
…URI columns (ray-project#58517)

The `_sample_sizes` method was using `as_completed()` to collect file
sizes, which returns results in completion order rather than submission
order. This scrambled the file sizes list so it no longer corresponded
to the input URI order.

When multiple URI columns are used, `_estimate_nrows_per_partition`
calls `zip(*sampled_file_sizes_by_column.values())` on line 284, which
assumes file sizes from different columns align by row index. The
scrambled ordering caused file sizes from different rows to be
incorrectly combined, producing incorrect partition size estimates.

## Changes

- Pre-allocate the `file_sizes` list with the correct size
- Use a `future_to_file_index` mapping to track the original submission
order
- Place results at their correct positions regardless of completion
order
- Add assertion to verify list length matches expected size

## Related issues

ray-project#58464 (comment)

---------

Signed-off-by: Balaji Veeramani <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…URI columns (ray-project#58517)

The `_sample_sizes` method was using `as_completed()` to collect file
sizes, which returns results in completion order rather than submission
order. This scrambled the file sizes list so it no longer corresponded
to the input URI order.

When multiple URI columns are used, `_estimate_nrows_per_partition`
calls `zip(*sampled_file_sizes_by_column.values())` on line 284, which
assumes file sizes from different columns align by row index. The
scrambled ordering caused file sizes from different rows to be
incorrectly combined, producing incorrect partition size estimates.

## Changes

- Pre-allocate the `file_sizes` list with the correct size
- Use a `future_to_file_index` mapping to track the original submission
order
- Place results at their correct positions regardless of completion
order
- Add assertion to verify list length matches expected size

## Related issues

ray-project#58464 (comment)

---------

Signed-off-by: Balaji Veeramani <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Future-Outlier <[email protected]>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…URI columns (ray-project#58517)

The `_sample_sizes` method was using `as_completed()` to collect file
sizes, which returns results in completion order rather than submission
order. This scrambled the file sizes list so it no longer corresponded
to the input URI order.

When multiple URI columns are used, `_estimate_nrows_per_partition`
calls `zip(*sampled_file_sizes_by_column.values())` on line 284, which
assumes file sizes from different columns align by row index. The
scrambled ordering caused file sizes from different rows to be
incorrectly combined, producing incorrect partition size estimates.

## Changes

- Pre-allocate the `file_sizes` list with the correct size
- Use a `future_to_file_index` mapping to track the original submission
order
- Place results at their correct positions regardless of completion
order
- Add assertion to verify list length matches expected size

## Related issues

ray-project#58464 (comment)

---------

Signed-off-by: Balaji Veeramani <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: peterxcli <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants