fix: Prefer numeric in type coercion for comparisons by neilconway · Pull Request #20426 · apache/datafusion

neilconway · 2026-02-19T02:09:16Z

Which issue does this PR close?

Closes Incorrect cast of integer columns to utf8 when comparing with utf8 constant #15161.

Rationale for this change

In a comparison between a numeric column and a string literal (e.g., WHERE int_col < '10'), we previously coerced the numeric column to be a string type. This resulted in doing a lexicographic comparison, which results in incorrect query results.

Instead, we split type coercion into two situations: type coercion for comparisons (including IN lists, BETWEEN, and CASE WHEN), where we want string->numeric coercion, and type coercion for places like UNION or CASE ... THEN/ELSE, where DataFusion's traditional behavior has been to tolerate type mismatching by coercing values to strings.

Here is a (not necessarily exhaustive) summary of the behavioral changes (old -> new):

  Comparisons (=, <, >, etc.):
    float_col = '5'      : string (wrong: '5'!='5.0')  -> numeric
    int_col > '100'      : string (wrong: '325'<'100') -> numeric
    int_col = 'hello'    : string, always false        -> cast error
    int_col = ''         : string, always false        -> cast error
    int_col = '99.99'    : string, always false        -> cast error
    Dict(Int) = '5'      : string                      -> numeric
    REE(Int) = '5'       : string                      -> numeric
    struct(int)=struct(str): int field to Utf8  -> str field to int

  IN lists:
    float_col IN ('1.0') : string (wrong: '1.0'!='1') -> numeric
    str_col IN ('a', 1)  : coerce to Utf8       -> coerce to Int64

  CASE:
    CASE str WHEN float  : coerce to Utf8       -> coerce to Float

  LIKE / regex:
    Dict(Int) LIKE '%5%' : coerce to Utf8       -> error (matches int)
    REE(Int) LIKE '%5%'  : coerce to Utf8       -> error (matches int)
    Dict(Int) ~ '5'      : coerce to Utf8       -> error (matches int)
    REE(Int) ~ '5'       : error (no REE)       -> error (REE added)
    REE(Utf8) ~ '5'      : error (no REE)       -> works (REE added)

What changes are included in this PR?

Update comparison_coercion to coerce strings to numerics
Remove previous comparison_coercion_numeric function
Add a new function, type_union_coercion, and use it when appropriate
Add support for REE types with regexp operators (this was unsupported for no good reason I can see)
Add unit and SLT tests for new coercion behavior
Update existing SLT tests for changes in coercion behavior
Fix the ClickBench unparser tests to avoid comparing int fields with non-numeric string literals

Are these changes tested?

Yes. New tests added, existing tests pass.

Are there any user-facing changes?

Yes, see table above. In most cases the new behavior should be more sensible and less error-prone, but it will likely break some user code.

neilconway · 2026-02-19T02:13:12Z

TODOs:

ClickBench queries are broken (see below)
We might be able to remove some/all of the special-case logic in cast_literal_to_type_with_op (see perf: unwrap cast for comparing ints =/!= strings #15110)
Decide whether type union coercion is well-motivated, or if we want to remove it (keep)

neilconway · 2026-02-19T02:30:41Z

One example of a weird behavior that we get with the new coercion rules proposed here:

> create table t1 (a text, b int, c float);
0 row(s) fetched.
Elapsed 0.018 seconds.

> explain select * from t1 where a < 5;
+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │         FilterExec        │ |
|               | │    --------------------   │ |
|               | │         predicate:        │ |
|               | │    CAST(a AS Int64) < 5   │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │          bytes: 0         │ |
|               | │       format: memory      │ |
|               | │          rows: 0          │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.016 seconds.

This follows from the "coerce string to numeric" change in general, but it's a bit weird. I think we should probably reject queries like this entirely, on the basis of a type mismatch.

If we did that and nothing else, we'd break a ton of queries (e.g., anything comparing a numeric column with a string literal). But we could fix that up by treating comparisons with string literals specially (e.g., "if x OP y mismatch in types, check if either x or y is a string literal; if so, try to coerce it to a numeric type"). So that's an alternative approach worth considering.

neilconway · 2026-02-19T02:35:56Z

The ClickBench queries are just weird: they have expressions like "EventDate" >= '2013-07-01', where EventDate is a uint16. Previously, EventDate would be cast to a string and the comparison would have been done lexicographically, which is wrong but didn't fail. Now we'll try to coerce the string literal to a numeric, which fails. So this is intended behavior -- the ClickBench queries need fixing up some other way.

AlonSpivack · 2026-02-19T08:12:16Z

Thanks for putting this PR together @neilconway!
Splitting out type_union_coercion is definitely the right architectural move.

Regarding your dilemma about text_col < 5:
I strongly agree with your intuition that we should reject these queries entirely with a clear type-mismatch/planning error, rather than trying to add special-case fallback logic. If a user actually wants to compare a string column to a number, they should be required to add an explicit CAST.

neilconway · 2026-02-19T12:15:22Z

@AlonSpivack The challenge with rejecting text_col < 5 under this approach is that in comparison_coercion, we've already lost the fact that we're comparing a column with a literal. All we see is a comparison between a string-typed value and a numeric-typed value, which this approach would allow.

If we want to reject text_col < 5 and similar expressions, I think we need to remove the comparison_coercion logic for coercing string -> numeric and the like. We'd then have two options:

Introduce an "unknown" type. All string literals would have unknown type, and then would be coerced into a concrete type depending on the context in which they are used. This is roughly what Postgres does.
Don't introduce a distinct type, but just add some (perhaps ad-hoc) rules for coercing binary expressions where exactly one operand is a string literal and the other operand is a numeric-typed column reference.

(1) is probably more principled, but I think it's too big of a change to undertake on short notice.

(2) sounds plausible to me at first blush, but it needs more consideration.

AlonSpivack · 2026-02-19T15:30:13Z

Hey @neilconway ,
thanks for moving so quickly on this. Splitting the coercion rules between comparison_coercion and type_union_coercion is definitely the right architectural foundation, and it aligns with what @alamb suggested previously about context-aware coercion.

Regarding the bad query plans you mentioned (like CAST(text_col AS Int64) < 5) and the ClickBench issues:

I think we should tackle this in two separate PRs to respect the project's architecture:

PR 1 (This PR - Type Level): Your current approach (coercing string→numeric for comparisons) is correct at the type-abstraction level. Generating a CAST here is the right logical step to ensure mathematical correctness. As @alamb noted in the past (e.g., on PR #15482), type coercion rules should be based strictly on (DataType, DataType) pairs, not on checking for Expr::Literal. So mixing literal evaluation into this specific phase would be architecturally incorrect.

PR 2 (Follow-up - Optimizer Level): Once this PR lands, I'd like to submit a follow-up PR targeting the simplifier (specifically extending cast_literal_to_type_with_op in unwrap_cast.rs). Since the simplifier is expected to be expression-aware, we can safely intercept the CAST there:

Early Literal Folding: Fold Literal(Utf8("10")) → Literal(Int64(10)) at planning time when compared against a numeric column. This gives zero runtime cost and restores performance/predicate pushdown.

Fail-Fast for Invalid Literals: Throw an instant plan_err! for unparseable literals like int_col < 'hello' instead of deferring to a runtime CAST failure.

Expanded Coverage: Extend this logic beyond =/!= to handle <, >, <=, >=, and include Float types.

This two-phase approach gets the critical bug (#15161) fixed quickly, keeps the type coercion pure, and handles the performance/AST-awareness in the correct optimizer pass.

Happy to help with the follow-up PR. Thanks again for driving this fix!

neilconway · 2026-02-20T20:01:18Z

@AlonSpivack I wrote up an alternative approach to fixing it, and some comments back on the issue; it might be better to continue further discussion there since there's two PRs now.

datafusion/expr-common/src/type_coercion/binary.rs

datafusion/core/tests/sql/unparser.rs

cetra3

LGTM but obviously need to work out what to do about those clickbench queries

neilconway · 2026-03-05T21:23:20Z

LGTM but obviously need to work out what to do about those clickbench queries

Thanks for the review, @cetra3 !

neilconway · 2026-03-06T03:05:59Z

@alamb @adriangb @comphead FYI, I revised this PR and I think the current version is ready for review. Potentially a candidate for 53, if there's time.

adriangb

This looks good on my end, I agree the new rules area easier to reason about and more correct. You did a great job refactoring the code to keep it clear and understandabl.e Thank you for including good docstrings and code pointers.

This does require another couple of approvals before merging, it's a pretty big behavior change. Maybe we can have @comphead review from the Comet side and @milenkovicm review from the Ballista side?

Any thoughts from the vortex folks @AdamGS ?

datafusion/core/src/physical_planner.rs

adriangb · 2026-03-09T16:00:19Z

datafusion/expr-common/src/type_coercion/binary.rs

+/// # Numeric / String comparisons
+///
+/// Prefers the numeric type (e.g., `'2' > 1` where `1` is `Int32` coerces
+/// `'2'` to `Int32`).
+///
+/// For type unification contexts (UNION, CASE THEN/ELSE), use
+/// [`type_union_coercion`] instead.


Worth including a bit more here, e.g. rationale, why we have different behavior (matches Postgres, etc?)

Done, although I added the comment to type_union_coercion instead.

FWIW our behavior here does not match Postgres; for example, Postgres rejects queries like "SELECT 1 UNION SELECT 'a'", which we allow. I started going down the path of rejecting those queries in #20456, but that will break more user code, so I figured it would be best to defer for now. Another construct we allow that Postgres doesn't is text_col <> int_col -- again, I'd argue that is probably better to reject, but I didn't take that on as part of this PR.

AdamGS · 2026-03-09T16:51:33Z

@adriangb this seems reasonable, but so far we've tried to not do any type coercion internally, we've already had some issues with decimals, I suspect this might trigger more of these cases, and we'll need to do some rewriting to match the types.

milenkovicm · 2026-03-09T19:44:33Z

I'm not sure if i can give meaningful contribution in this case, ballista is relaying on datafusion for this behaviour.

would we need to capture change of behaviour somewhere, like upgrade guide ?

adriangb · 2026-03-09T19:57:25Z

I'm not sure if i can give meaningful contribution in this case, ballista is relaying on datafusion for this behaviour.

would we need to capture change of behaviour somewhere, like upgrade guide ?

Yes I think we should make an upgrading guide entry @neilconway.

If Ballista just uses DataFusion semantics and you are okay with this change then nothing to do on your end 😃

milenkovicm · 2026-03-09T20:07:40Z

Maybe sending an email to dev list for visibility would make sense, would it?

datafusion/expr-common/src/signature.rs

datafusion/expr/src/type_coercion/other.rs

Co-authored-by: Martin Grigorov <[email protected]>

neilconway · 2026-03-09T22:45:35Z

@adriangb @martin-g Thank you for the reviews!

I added a section to the upgrade guide for this change.

AlonSpivack mentioned this pull request Feb 19, 2026

Incorrect cast of integer columns to utf8 when comparing with utf8 constant #15161

Open

neilconway force-pushed the neilc/type-coercion-str-numeric branch from 9249b8d to 53734b1 Compare February 20, 2026 16:07

neilconway mentioned this pull request Feb 20, 2026

fix: Disallow implicit coercion between string and numeric types in binary operators #20456

Closed

neilconway mentioned this pull request Mar 4, 2026

Release DataFusion 53.0.0 (Feb 2026 / Mar 2026) #19692

Open

26 tasks

cetra3 reviewed Mar 5, 2026

View reviewed changes

datafusion/expr-common/src/type_coercion/binary.rs Outdated Show resolved Hide resolved

cetra3 reviewed Mar 5, 2026

View reviewed changes

datafusion/core/tests/sql/unparser.rs Outdated Show resolved Hide resolved

cetra3 reviewed Mar 5, 2026

View reviewed changes

Revised approach to fixing numeric/string type coercion

ba74cd2

neilconway force-pushed the neilc/type-coercion-str-numeric branch from 53734b1 to ba74cd2 Compare March 5, 2026 21:22

neilconway added 2 commits March 5, 2026 21:41

Revert map key coercion regression, fix other tests

4bf9fe4

cargo fmt

1cd2602

adriangb approved these changes Mar 9, 2026

View reviewed changes

martin-g reviewed Mar 9, 2026

View reviewed changes

neilconway and others added 5 commits March 9, 2026 15:17

More rationale for type union behavior

e2272f0

Fix comment typos

e52f506

Co-authored-by: Martin Grigorov <[email protected]>

Apply suggestion from @martin-g

8084010

Co-authored-by: Martin Grigorov <[email protected]>

Per review, explicitly test for IN list type mismatch error

7667bc1

Add upgrade notes for change

43657dd

github-actions bot added the documentation Improvements or additions to documentation label Mar 9, 2026

neilconway added 3 commits March 9, 2026 15:46

Fix docs prettier lint

36cd3df

Fix think in docs

e12bad2

Doc tweaks

6bdfc37

Conversation

neilconway commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

neilconway commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neilconway commented Feb 19, 2026

Uh oh!

neilconway commented Feb 19, 2026

Uh oh!

AlonSpivack commented Feb 19, 2026

Uh oh!

neilconway commented Feb 19, 2026

Uh oh!

AlonSpivack commented Feb 19, 2026

Uh oh!

neilconway commented Feb 20, 2026

Uh oh!

Uh oh!

Uh oh!

cetra3 left a comment

Choose a reason for hiding this comment

Uh oh!

neilconway commented Mar 5, 2026

Uh oh!

neilconway commented Mar 6, 2026

Uh oh!

adriangb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adriangb Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

neilconway Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

AdamGS commented Mar 9, 2026

Uh oh!

milenkovicm commented Mar 9, 2026

Uh oh!

adriangb commented Mar 9, 2026

Uh oh!

milenkovicm commented Mar 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

neilconway commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

neilconway commented Feb 19, 2026 •

edited

Loading

neilconway commented Feb 19, 2026 •

edited

Loading