Skip to content

fix: align file_id with DataFusion UInt16 dictionary#4167

Merged
roeap merged 4 commits intodelta-io:mainfrom
ethan-tyler:fix/file-id-uint16
Feb 5, 2026
Merged

fix: align file_id with DataFusion UInt16 dictionary#4167
roeap merged 4 commits intodelta-io:mainfrom
ethan-tyler:fix/file-id-uint16

Conversation

@ethan-tyler
Copy link
Copy Markdown
Collaborator

Description

The synthetic file_id column type was defined independently allowing silent drift between Int32 and UInt16. This causes type coercion failures in DML paths that build file_id IN (...) predicates.

Changes:
Adds file_id.rs as single source of truth for the column name, data type,
field constructor, and dictionary aligned literal wrapper. All call sites now go
through these helpers.

Adds a guard to chunk FileGroups so the per-group partition dictionary
cannot exceed the UInt16 keyspace, with two counters
(count_file_groups_planned, count_file_group_chunks) to observe when
chunking triggers.

file_id remains an internal correlation mechanism. This reduces its surface
area and makes removal easier once ParquetAccessPlan based scan filtering
lands.

Related Issue(s)

Related:

Documentation

Centralize synthetic file-id type/value construction as Dictionary<UInt16, Utf8> and use it across next/legacy providers, meta-only scans, and matched-file DML planning.

Track file-group chunking required to stay within the UInt16 dictionary keyspace.

Signed-off-by: Ethan Urbanski <[email protected]>
@github-actions github-actions Bot added the binding/rust Issues for the Rust crate label Feb 5, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 5, 2026

Codecov Report

❌ Patch coverage is 90.47619% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.35%. Comparing base (9261e39) to head (c39edd9).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/core/src/delta_datafusion/find_files.rs 53.33% 1 Missing and 6 partials ⚠️
...c/delta_datafusion/table_provider/next/scan/mod.rs 97.91% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4167      +/-   ##
==========================================
- Coverage   76.36%   76.35%   -0.02%     
==========================================
  Files         165      166       +1     
  Lines       46274    46286      +12     
  Branches    46274    46286      +12     
==========================================
+ Hits        35337    35341       +4     
- Misses       9262     9265       +3     
- Partials     1675     1680       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some small comments.

Comment thread crates/core/src/delta_datafusion/table_provider/next/scan/mod.rs Outdated
Comment thread crates/core/src/delta_datafusion/file_id.rs
Comment thread crates/core/src/delta_datafusion/table_provider/next/scan/mod.rs Outdated
Comment thread crates/core/src/delta_datafusion/table_provider/next/scan/mod.rs Outdated
Comment thread crates/core/src/delta_datafusion/table_provider/next/scan/mod.rs Outdated
@roeap roeap merged commit fad773d into delta-io:main Feb 5, 2026
26 checks passed
@ethan-tyler ethan-tyler deleted the fix/file-id-uint16 branch February 5, 2026 21:07
roeap pushed a commit to roeap/delta-rs that referenced this pull request Feb 6, 2026
The synthetic `file_id` column type was defined independently allowing
silent drift between `Int32` and `UInt16`. This causes type coercion
failures in DML paths that build `file_id IN (...)` predicates.

**Changes:**
Adds `file_id.rs` as single source of truth for the column name, data
type,
field constructor, and dictionary aligned literal wrapper. All call
sites now go
through these helpers.

Adds a guard to chunk `FileGroup`s so the per-group partition dictionary
cannot exceed the `UInt16` keyspace, with two counters
(`count_file_groups_planned`, `count_file_group_chunks`) to observe when
chunking triggers.

`file_id` remains an internal correlation mechanism. This reduces its
surface
area and makes removal easier once `ParquetAccessPlan` based scan
filtering
lands.

Related:
- delta-io#4113
- delta-io#4115
<!---
For example:

- closes #106
--->

<!---
Share links to useful documentation
--->

---------

Signed-off-by: Ethan Urbanski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/rust Issues for the Rust crate

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants