Skip to content

fix: Prefer numeric in type coercion for comparisons#20426

Open
neilconway wants to merge 11 commits intoapache:mainfrom
neilconway:neilc/type-coercion-str-numeric
Open

fix: Prefer numeric in type coercion for comparisons#20426
neilconway wants to merge 11 commits intoapache:mainfrom
neilconway:neilc/type-coercion-str-numeric

Conversation

@neilconway
Copy link
Contributor

@neilconway neilconway commented Feb 19, 2026

Which issue does this PR close?

Rationale for this change

In a comparison between a numeric column and a string literal (e.g., WHERE int_col < '10'), we previously coerced the numeric column to be a string type. This resulted in doing a lexicographic comparison, which results in incorrect query results.

Instead, we split type coercion into two situations: type coercion for comparisons (including IN lists, BETWEEN, and CASE WHEN), where we want string->numeric coercion, and type coercion for places like UNION or CASE ... THEN/ELSE, where DataFusion's traditional behavior has been to tolerate type mismatching by coercing values to strings.

Here is a (not necessarily exhaustive) summary of the behavioral changes (old -> new):

  Comparisons (=, <, >, etc.):
    float_col = '5'      : string (wrong: '5'!='5.0')  -> numeric
    int_col > '100'      : string (wrong: '325'<'100') -> numeric
    int_col = 'hello'    : string, always false        -> cast error
    int_col = ''         : string, always false        -> cast error
    int_col = '99.99'    : string, always false        -> cast error
    Dict(Int) = '5'      : string                      -> numeric
    REE(Int) = '5'       : string                      -> numeric
    struct(int)=struct(str): int field to Utf8  -> str field to int

  IN lists:
    float_col IN ('1.0') : string (wrong: '1.0'!='1') -> numeric
    str_col IN ('a', 1)  : coerce to Utf8       -> coerce to Int64

  CASE:
    CASE str WHEN float  : coerce to Utf8       -> coerce to Float

  LIKE / regex:
    Dict(Int) LIKE '%5%' : coerce to Utf8       -> error (matches int)
    REE(Int) LIKE '%5%'  : coerce to Utf8       -> error (matches int)
    Dict(Int) ~ '5'      : coerce to Utf8       -> error (matches int)
    REE(Int) ~ '5'       : error (no REE)       -> error (REE added)
    REE(Utf8) ~ '5'      : error (no REE)       -> works (REE added)

What changes are included in this PR?

  • Update comparison_coercion to coerce strings to numerics
  • Remove previous comparison_coercion_numeric function
  • Add a new function, type_union_coercion, and use it when appropriate
  • Add support for REE types with regexp operators (this was unsupported for no good reason I can see)
  • Add unit and SLT tests for new coercion behavior
  • Update existing SLT tests for changes in coercion behavior
  • Fix the ClickBench unparser tests to avoid comparing int fields with non-numeric string literals

Are these changes tested?

Yes. New tests added, existing tests pass.

Are there any user-facing changes?

Yes, see table above. In most cases the new behavior should be more sensible and less error-prone, but it will likely break some user code.

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate functions Changes to functions implementation labels Feb 19, 2026
@neilconway
Copy link
Contributor Author

neilconway commented Feb 19, 2026

TODOs:

  • ClickBench queries are broken (see below)
  • We might be able to remove some/all of the special-case logic in cast_literal_to_type_with_op (see perf: unwrap cast for comparing ints =/!= strings #15110)
  • Decide whether type union coercion is well-motivated, or if we want to remove it (keep)

@neilconway
Copy link
Contributor Author

One example of a weird behavior that we get with the new coercion rules proposed here:

> create table t1 (a text, b int, c float);
0 row(s) fetched.
Elapsed 0.018 seconds.

> explain select * from t1 where a < 5;
+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │         FilterExec        │ |
|               | │    --------------------   │ |
|               | │         predicate:        │ |
|               | │    CAST(a AS Int64) < 5   │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │          bytes: 0         │ |
|               | │       format: memory      │ |
|               | │          rows: 0          │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.016 seconds.

This follows from the "coerce string to numeric" change in general, but it's a bit weird. I think we should probably reject queries like this entirely, on the basis of a type mismatch.

If we did that and nothing else, we'd break a ton of queries (e.g., anything comparing a numeric column with a string literal). But we could fix that up by treating comparisons with string literals specially (e.g., "if x OP y mismatch in types, check if either x or y is a string literal; if so, try to coerce it to a numeric type"). So that's an alternative approach worth considering.

@neilconway
Copy link
Contributor Author

The ClickBench queries are just weird: they have expressions like "EventDate" >= '2013-07-01', where EventDate is a uint16. Previously, EventDate would be cast to a string and the comparison would have been done lexicographically, which is wrong but didn't fail. Now we'll try to coerce the string literal to a numeric, which fails. So this is intended behavior -- the ClickBench queries need fixing up some other way.

@AlonSpivack
Copy link

Thanks for putting this PR together @neilconway!
Splitting out type_union_coercion is definitely the right architectural move.

Regarding your dilemma about text_col < 5:
I strongly agree with your intuition that we should reject these queries entirely with a clear type-mismatch/planning error, rather than trying to add special-case fallback logic. If a user actually wants to compare a string column to a number, they should be required to add an explicit CAST.

@neilconway
Copy link
Contributor Author

@AlonSpivack The challenge with rejecting text_col < 5 under this approach is that in comparison_coercion, we've already lost the fact that we're comparing a column with a literal. All we see is a comparison between a string-typed value and a numeric-typed value, which this approach would allow.

If we want to reject text_col < 5 and similar expressions, I think we need to remove the comparison_coercion logic for coercing string -> numeric and the like. We'd then have two options:

  1. Introduce an "unknown" type. All string literals would have unknown type, and then would be coerced into a concrete type depending on the context in which they are used. This is roughly what Postgres does.
  2. Don't introduce a distinct type, but just add some (perhaps ad-hoc) rules for coercing binary expressions where exactly one operand is a string literal and the other operand is a numeric-typed column reference.

(1) is probably more principled, but I think it's too big of a change to undertake on short notice.

(2) sounds plausible to me at first blush, but it needs more consideration.

@AlonSpivack
Copy link

Hey @neilconway ,
thanks for moving so quickly on this. Splitting the coercion rules between comparison_coercion and type_union_coercion is definitely the right architectural foundation, and it aligns with what @alamb suggested previously about context-aware coercion.

Regarding the bad query plans you mentioned (like CAST(text_col AS Int64) < 5) and the ClickBench issues:

I think we should tackle this in two separate PRs to respect the project's architecture:

PR 1 (This PR - Type Level): Your current approach (coercing string→numeric for comparisons) is correct at the type-abstraction level. Generating a CAST here is the right logical step to ensure mathematical correctness. As @alamb noted in the past (e.g., on PR #15482), type coercion rules should be based strictly on (DataType, DataType) pairs, not on checking for Expr::Literal. So mixing literal evaluation into this specific phase would be architecturally incorrect.

PR 2 (Follow-up - Optimizer Level): Once this PR lands, I'd like to submit a follow-up PR targeting the simplifier (specifically extending cast_literal_to_type_with_op in unwrap_cast.rs). Since the simplifier is expected to be expression-aware, we can safely intercept the CAST there:

Early Literal Folding: Fold Literal(Utf8("10")) → Literal(Int64(10)) at planning time when compared against a numeric column. This gives zero runtime cost and restores performance/predicate pushdown.

Fail-Fast for Invalid Literals: Throw an instant plan_err! for unparseable literals like int_col < 'hello' instead of deferring to a runtime CAST failure.

Expanded Coverage: Extend this logic beyond =/!= to handle <, >, <=, >=, and include Float types.

This two-phase approach gets the critical bug (#15161) fixed quickly, keeps the type coercion pure, and handles the performance/AST-awareness in the correct optimizer pass.

Happy to help with the follow-up PR. Thanks again for driving this fix!

@neilconway
Copy link
Contributor Author

@AlonSpivack I wrote up an alternative approach to fixing it, and some comments back on the issue; it might be better to continue further discussion there since there's two PRs now.

Copy link
Contributor

@cetra3 cetra3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but obviously need to work out what to do about those clickbench queries

@neilconway neilconway force-pushed the neilc/type-coercion-str-numeric branch from 53734b1 to ba74cd2 Compare March 5, 2026 21:22
@neilconway
Copy link
Contributor Author

LGTM but obviously need to work out what to do about those clickbench queries

Thanks for the review, @cetra3 !

@neilconway
Copy link
Contributor Author

@alamb @adriangb @comphead FYI, I revised this PR and I think the current version is ready for review. Potentially a candidate for 53, if there's time.

Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good on my end, I agree the new rules area easier to reason about and more correct. You did a great job refactoring the code to keep it clear and understandabl.e Thank you for including good docstrings and code pointers.

This does require another couple of approvals before merging, it's a pretty big behavior change. Maybe we can have @comphead review from the Comet side and @milenkovicm review from the Ballista side?

Any thoughts from the vortex folks @AdamGS ?

Comment on lines +884 to +890
/// # Numeric / String comparisons
///
/// Prefers the numeric type (e.g., `'2' > 1` where `1` is `Int32` coerces
/// `'2'` to `Int32`).
///
/// For type unification contexts (UNION, CASE THEN/ELSE), use
/// [`type_union_coercion`] instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth including a bit more here, e.g. rationale, why we have different behavior (matches Postgres, etc?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, although I added the comment to type_union_coercion instead.

FWIW our behavior here does not match Postgres; for example, Postgres rejects queries like "SELECT 1 UNION SELECT 'a'", which we allow. I started going down the path of rejecting those queries in #20456, but that will break more user code, so I figured it would be best to defer for now. Another construct we allow that Postgres doesn't is text_col <> int_col -- again, I'd argue that is probably better to reject, but I didn't take that on as part of this PR.

@AdamGS
Copy link
Contributor

AdamGS commented Mar 9, 2026

@adriangb this seems reasonable, but so far we've tried to not do any type coercion internally, we've already had some issues with decimals, I suspect this might trigger more of these cases, and we'll need to do some rewriting to match the types.

@milenkovicm
Copy link
Contributor

I'm not sure if i can give meaningful contribution in this case, ballista is relaying on datafusion for this behaviour.

would we need to capture change of behaviour somewhere, like upgrade guide ?

@adriangb
Copy link
Contributor

adriangb commented Mar 9, 2026

I'm not sure if i can give meaningful contribution in this case, ballista is relaying on datafusion for this behaviour.

would we need to capture change of behaviour somewhere, like upgrade guide ?

Yes I think we should make an upgrading guide entry @neilconway.

If Ballista just uses DataFusion semantics and you are okay with this change then nothing to do on your end 😃

@milenkovicm
Copy link
Contributor

Maybe sending an email to dev list for visibility would make sense, would it?

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 9, 2026
@neilconway
Copy link
Contributor Author

@adriangb @martin-g Thank you for the reviews!

I added a section to the upgrade guide for this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate documentation Improvements or additions to documentation functions Changes to functions implementation logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect cast of integer columns to utf8 when comparing with utf8 constant

7 participants