fix: Prefer numeric in type coercion for comparisons#20426
fix: Prefer numeric in type coercion for comparisons#20426neilconway wants to merge 11 commits intoapache:mainfrom
Conversation
|
TODOs:
|
|
One example of a weird behavior that we get with the new coercion rules proposed here: This follows from the "coerce string to numeric" change in general, but it's a bit weird. I think we should probably reject queries like this entirely, on the basis of a type mismatch. If we did that and nothing else, we'd break a ton of queries (e.g., anything comparing a numeric column with a string literal). But we could fix that up by treating comparisons with string literals specially (e.g., "if x OP y mismatch in types, check if either x or y is a string literal; if so, try to coerce it to a numeric type"). So that's an alternative approach worth considering. |
|
The ClickBench queries are just weird: they have expressions like |
|
Thanks for putting this PR together @neilconway! Regarding your dilemma about |
|
@AlonSpivack The challenge with rejecting If we want to reject
(1) is probably more principled, but I think it's too big of a change to undertake on short notice. (2) sounds plausible to me at first blush, but it needs more consideration. |
|
Hey @neilconway , Regarding the bad query plans you mentioned (like CAST(text_col AS Int64) < 5) and the ClickBench issues: I think we should tackle this in two separate PRs to respect the project's architecture: PR 1 (This PR - Type Level): Your current approach (coercing string→numeric for comparisons) is correct at the type-abstraction level. Generating a CAST here is the right logical step to ensure mathematical correctness. As @alamb noted in the past (e.g., on PR #15482), type coercion rules should be based strictly on (DataType, DataType) pairs, not on checking for Expr::Literal. So mixing literal evaluation into this specific phase would be architecturally incorrect. PR 2 (Follow-up - Optimizer Level): Once this PR lands, I'd like to submit a follow-up PR targeting the simplifier (specifically extending cast_literal_to_type_with_op in unwrap_cast.rs). Since the simplifier is expected to be expression-aware, we can safely intercept the CAST there: Early Literal Folding: Fold Literal(Utf8("10")) → Literal(Int64(10)) at planning time when compared against a numeric column. This gives zero runtime cost and restores performance/predicate pushdown. Fail-Fast for Invalid Literals: Throw an instant plan_err! for unparseable literals like int_col < 'hello' instead of deferring to a runtime CAST failure. Expanded Coverage: Extend this logic beyond =/!= to handle <, >, <=, >=, and include Float types. This two-phase approach gets the critical bug (#15161) fixed quickly, keeps the type coercion pure, and handles the performance/AST-awareness in the correct optimizer pass. Happy to help with the follow-up PR. Thanks again for driving this fix! |
9249b8d to
53734b1
Compare
|
@AlonSpivack I wrote up an alternative approach to fixing it, and some comments back on the issue; it might be better to continue further discussion there since there's two PRs now. |
cetra3
left a comment
There was a problem hiding this comment.
LGTM but obviously need to work out what to do about those clickbench queries
53734b1 to
ba74cd2
Compare
Thanks for the review, @cetra3 ! |
adriangb
left a comment
There was a problem hiding this comment.
This looks good on my end, I agree the new rules area easier to reason about and more correct. You did a great job refactoring the code to keep it clear and understandabl.e Thank you for including good docstrings and code pointers.
This does require another couple of approvals before merging, it's a pretty big behavior change. Maybe we can have @comphead review from the Comet side and @milenkovicm review from the Ballista side?
Any thoughts from the vortex folks @AdamGS ?
| /// # Numeric / String comparisons | ||
| /// | ||
| /// Prefers the numeric type (e.g., `'2' > 1` where `1` is `Int32` coerces | ||
| /// `'2'` to `Int32`). | ||
| /// | ||
| /// For type unification contexts (UNION, CASE THEN/ELSE), use | ||
| /// [`type_union_coercion`] instead. |
There was a problem hiding this comment.
Worth including a bit more here, e.g. rationale, why we have different behavior (matches Postgres, etc?)
There was a problem hiding this comment.
Done, although I added the comment to type_union_coercion instead.
FWIW our behavior here does not match Postgres; for example, Postgres rejects queries like "SELECT 1 UNION SELECT 'a'", which we allow. I started going down the path of rejecting those queries in #20456, but that will break more user code, so I figured it would be best to defer for now. Another construct we allow that Postgres doesn't is text_col <> int_col -- again, I'd argue that is probably better to reject, but I didn't take that on as part of this PR.
|
@adriangb this seems reasonable, but so far we've tried to not do any type coercion internally, we've already had some issues with decimals, I suspect this might trigger more of these cases, and we'll need to do some rewriting to match the types. |
|
I'm not sure if i can give meaningful contribution in this case, ballista is relaying on datafusion for this behaviour. would we need to capture change of behaviour somewhere, like upgrade guide ? |
Yes I think we should make an upgrading guide entry @neilconway. If Ballista just uses DataFusion semantics and you are okay with this change then nothing to do on your end 😃 |
|
Maybe sending an email to dev list for visibility would make sense, would it? |
Co-authored-by: Martin Grigorov <[email protected]>
Co-authored-by: Martin Grigorov <[email protected]>
Which issue does this PR close?
Rationale for this change
In a comparison between a numeric column and a string literal (e.g.,
WHERE int_col < '10'), we previously coerced the numeric column to be a string type. This resulted in doing a lexicographic comparison, which results in incorrect query results.Instead, we split type coercion into two situations: type coercion for comparisons (including
INlists,BETWEEN, andCASE WHEN), where we want string->numeric coercion, and type coercion for places likeUNIONorCASE ... THEN/ELSE, where DataFusion's traditional behavior has been to tolerate type mismatching by coercing values to strings.Here is a (not necessarily exhaustive) summary of the behavioral changes (old -> new):
What changes are included in this PR?
comparison_coercionto coerce strings to numericscomparison_coercion_numericfunctiontype_union_coercion, and use it when appropriateAre these changes tested?
Yes. New tests added, existing tests pass.
Are there any user-facing changes?
Yes, see table above. In most cases the new behavior should be more sensible and less error-prone, but it will likely break some user code.