fix: Set Substrait output types for expressions#20597
Conversation
|
@gabotechs Could you possibly take a look at this or suggest another reviewer? Thanks! |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
0fdbc92 to
6d0fbdc
Compare
|
@alamb Any chance you could help me find a reviewer for this one? (I'm not sure how I should go about this as a first-time DataFusion contributor). Many thanks! |
|
I @wlhjason -- thanks for the ping. Maybe @westonpace or @gabotechs have time to review this one (or know someone who does) |
|
Requesting backup now |
westonpace
left a comment
There was a problem hiding this comment.
This looks correct to me. The output type is not optional according to Substrait so the previous behavior was not compliant with the spec. This change brings things into compliance.
| producer: &mut impl SubstraitProducer, | ||
| name: &str, | ||
| args: &[Expr], | ||
| output_field: &Field, |
There was a problem hiding this comment.
Minor nit: although all the call-sites are able to come up with a Field easily enough all you really need here is a DataType so it might be nice to change this to DataType in case some future caller doesn't have a name handy.
There was a problem hiding this comment.
Changed to just require a DataType and nullability
| let expr = if *negated { | ||
| // `expr NOT BETWEEN low AND high` can be translated into (expr < low OR high < expr) | ||
| Expr::or( | ||
| Expr::lt(*expr.clone(), *low.clone()), | ||
| Expr::lt(*high.clone(), *expr.clone()), | ||
| ) |
There was a problem hiding this comment.
This change seems unrelated to the problem at hand. While I think it probably is correct that you can switch from BETWEEN to comparison operators I also think this is more of a plan rewrite than is normally done during Substrait conversion. Users might be surprised if they round-trip through Substrait and end up with a different plan than they started with.
There was a problem hiding this comment.
This is actually just a refactor of the existing logic to transform BETWEEN into a Boolean expression!
The previous implementation used 6 direct calls to make_binary_op_scalar_func. Rather than adding a data type and nullability to every one of these, I thought this was a good chance to vastly simplify the implementation.
| let all_conditions = join_on | ||
| .into_iter() | ||
| .map(|(left, right)| binary_expr(left, eq_op, right)) | ||
| .chain(join_filter); |
There was a problem hiding this comment.
This is clever and much more compact than the old impl, well done. I'm not 100% sure of the correctness but I assume the existing unit tests would catch it otherwise.
There was a problem hiding this comment.
I've added a test for from_join in edfd835 which passes before and after the refactor
There was a problem hiding this comment.
Regarding this comment from Claude:
from_join behavioral change is observable on the wire
I don't think the behavior has actually changed here, as the new test shows - we still get (join_cond_1 AND join_cond_2) AND join_filter rather than join_cond_1 AND (join_cond_2 AND join_filter).
There was a problem hiding this comment.
PS: The simplification of this code via conjunction was inspired by @alamb's suggestion on the original PR: #7612 (comment) 🙏
| use substrait::proto::{Expression, JoinRel, Rel, join_rel}; | ||
| use substrait::proto::{JoinRel, Rel, join_rel}; | ||
|
|
||
| pub fn from_join( |
There was a problem hiding this comment.
Can you add a unit test for regression purposes ensuring that the join expression has an output type?
|
Here's some feedback from Claude: I don't think I'm two worried about P1. The plans are still roughly equivalent. Might be worth just calling out in the PR description. I agree with P2, in particular, we should have a test for the join case. I think P3 is inevitable and somewhat accepted at the moment. Agree that the things in P4 are minor, take them or leave them. |
876cf5f to
d560e09
Compare
|
Many thanks for the review, @westonpace! I think I've addressed all the key points now. |
Agreed. I don't have merge privilege but maybe @alamb can pull the trigger? |
Which issue does this PR close?
BinaryExprincludesoutput_type#15831.Rationale for this change
The Substrait producer did not set the ScalarFunction
output_typewhen converting binary expressions, which broke consumers relying on theoutput_type.What changes are included in this PR?
from_joinandfrom_betweento eliminate direct calls tomake_binary_op_scalar_funcoutput_typewhen converting several types of DataFusion expressions:Expr::BinaryExpr)Expr::Not)Expr::ScalarFunction)There are a few more places where the
output_typehas not been set, such asfrom_likeandfrom_in_list, as mentioned in #15831. I've left these out of scope here as fixing them would require more substantial code changes.Are these changes tested?
Yes, via a new unit test.
Are there any user-facing changes?
No, beyond the Substrait output fix.