Skip to content

fix: Set Substrait output types for expressions#20597

Open
wlhjason wants to merge 7 commits into
apache:mainfrom
wlhjason:GH-15831-substrait-output-types
Open

fix: Set Substrait output types for expressions#20597
wlhjason wants to merge 7 commits into
apache:mainfrom
wlhjason:GH-15831-substrait-output-types

Conversation

@wlhjason
Copy link
Copy Markdown

Which issue does this PR close?

Rationale for this change

The Substrait producer did not set the ScalarFunction output_type when converting binary expressions, which broke consumers relying on the output_type.

What changes are included in this PR?

  • Refactor from_join and from_between to eliminate direct calls to make_binary_op_scalar_func
  • Set the Substrait ScalarFunction output_type when converting several types of DataFusion expressions:
    • Binary expressions (Expr::BinaryExpr)
    • Unary expressions (like Expr::Not)
    • Scalar functions (Expr::ScalarFunction)

There are a few more places where the output_type has not been set, such as from_like and from_in_list, as mentioned in #15831. I've left these out of scope here as fixing them would require more substantial code changes.

Are these changes tested?

Yes, via a new unit test.

Are there any user-facing changes?

No, beyond the Substrait output fix.

@github-actions github-actions Bot added the substrait Changes to the substrait crate label Feb 27, 2026
@wlhjason
Copy link
Copy Markdown
Author

@gabotechs Could you possibly take a look at this or suggest another reviewer? Thanks!

@github-actions
Copy link
Copy Markdown

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions Bot added the Stale PR has not had any activity for some time label May 10, 2026
@wlhjason wlhjason force-pushed the GH-15831-substrait-output-types branch 2 times, most recently from 0fdbc92 to 6d0fbdc Compare May 16, 2026 18:04
@wlhjason
Copy link
Copy Markdown
Author

@alamb Any chance you could help me find a reviewer for this one? (I'm not sure how I should go about this as a first-time DataFusion contributor). Many thanks!

@github-actions github-actions Bot removed the Stale PR has not had any activity for some time label May 17, 2026
@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 18, 2026

I @wlhjason -- thanks for the ping.

Maybe @westonpace or @gabotechs have time to review this one (or know someone who does)

@gabotechs
Copy link
Copy Markdown
Contributor

Requesting backup now

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct to me. The output type is not optional according to Substrait so the previous behavior was not compliant with the spec. This change brings things into compliance.

producer: &mut impl SubstraitProducer,
name: &str,
args: &[Expr],
output_field: &Field,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: although all the call-sites are able to come up with a Field easily enough all you really need here is a DataType so it might be nice to change this to DataType in case some future caller doesn't have a name handy.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to just require a DataType and nullability

Comment on lines +301 to +306
let expr = if *negated {
// `expr NOT BETWEEN low AND high` can be translated into (expr < low OR high < expr)
Expr::or(
Expr::lt(*expr.clone(), *low.clone()),
Expr::lt(*high.clone(), *expr.clone()),
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems unrelated to the problem at hand. While I think it probably is correct that you can switch from BETWEEN to comparison operators I also think this is more of a plan rewrite than is normally done during Substrait conversion. Users might be surprised if they round-trip through Substrait and end up with a different plan than they started with.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually just a refactor of the existing logic to transform BETWEEN into a Boolean expression!

The previous implementation used 6 direct calls to make_binary_op_scalar_func. Rather than adding a data type and nullability to every one of these, I thought this was a good chance to vastly simplify the implementation.

Comment on lines +75 to +78
let all_conditions = join_on
.into_iter()
.map(|(left, right)| binary_expr(left, eq_op, right))
.chain(join_filter);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is clever and much more compact than the old impl, well done. I'm not 100% sure of the correctness but I assume the existing unit tests would catch it otherwise.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test for from_join in edfd835 which passes before and after the refactor

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this comment from Claude:

from_join behavioral change is observable on the wire

I don't think the behavior has actually changed here, as the new test shows - we still get (join_cond_1 AND join_cond_2) AND join_filter rather than join_cond_1 AND (join_cond_2 AND join_filter).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS: The simplification of this code via conjunction was inspired by @alamb's suggestion on the original PR: #7612 (comment) 🙏

use substrait::proto::{Expression, JoinRel, Rel, join_rel};
use substrait::proto::{JoinRel, Rel, join_rel};

pub fn from_join(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a unit test for regression purposes ensuring that the join expression has an output type?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in e2cb497

@westonpace
Copy link
Copy Markdown
Member

Here's some feedback from Claude:

  P1 — from_join behavioral change is observable on the wire

  The old code combined on and filter as AND(reduce_left(on_conditions), filter). The new code uses conjunction(on_conditions.chain(filter)), which folds right in DataFusion's conjunction
  helper (it uses Iterator::reduce and Expr::and, but the associativity of the resulting tree may differ). Semantically equivalent, but the emitted Substrait byte representation will change
  for joins with 2+ ON predicates plus a filter. Worth:
  - Calling this out in the PR description.
  - Adding a join-with-filter test (or sqllogictest) that exercises both on and filter simultaneously and asserts the produced plan, so future refactors don't silently change wire format
  again.

  P2 — Test coverage is thin

  Only one new test, covering the simplest case (int64 + int64). Given the scope (binary, unary, scalar, higher-order, join refactor, between refactor), consider adding:
  - Unary case (Expr::Not(...) → Boolean, nullable matches input).
  - A nullable-vs-non-nullable case to confirm is_nullable() propagates into the Substrait type's nullability.
  - A from_between test that round-trips both BETWEEN and NOT BETWEEN — the refactor here is non-trivial and lacks direct coverage.
  - A from_join test asserting the join expression structure after the refactor.

  P3 — Expr::to_field(schema)? cost

  Expr::to_field recursively re-derives types from the schema. With this change, every from_binary_expr call now does a to_field over the whole subexpression and recurses into each child,
  which itself re-runs to_field on its subtree. That makes typing O(depth × nodes) over the expression tree. For typical SQL this is unnoticeable, but for generated/synthetic deep
  expressions it can be a quadratic blow-up. Not a blocker, but worth knowing — caching (data_type, nullable) on the way down (or accepting an already-computed type from the caller) would
  fix it.

  P4 — Minor cleanups

  - Expr::ScalarFunction(fun.clone()).to_field(schema)? and Expr::BinaryExpr(expr.clone()).to_field(schema)? clone the expression solely to call a trait method on &Expr. If ExprSchemable
  exposed equivalents on the inner types (ScalarFunction, BinaryExpr), the clones could be avoided. Not introduced by this PR, but the new code makes them more prevalent.
  - In from_between, the *expr.clone() pattern occurs four times in the BETWEEN-not-negated branch:
  Expr::and(
      Expr::lt_eq(*low.clone(), *expr.clone()),
      Expr::lt_eq(*expr.clone(), *high.clone()),
  )
  - A let expr_value = (*expr).clone(); once at the top would tighten this.
  - The new test is #[tokio::test] but the body is fully synchronous — #[test] is sufficient.
  - to_substrait_type(producer, output_field.data_type(), output_field.is_nullable()) is called 3 times. The existing to_substrait_type_from_field would be a slightly cleaner call (after
  wrapping Field → FieldRef). Pure nit.

I don't think I'm two worried about P1. The plans are still roughly equivalent. Might be worth just calling out in the PR description.

I agree with P2, in particular, we should have a test for the join case.

I think P3 is inevitable and somewhat accepted at the moment.

Agree that the things in P4 are minor, take them or leave them.

@wlhjason wlhjason force-pushed the GH-15831-substrait-output-types branch from 876cf5f to d560e09 Compare May 23, 2026 21:41
@wlhjason
Copy link
Copy Markdown
Author

Many thanks for the review, @westonpace! I think I've addressed all the key points now.

@westonpace
Copy link
Copy Markdown
Member

I think I've addressed all the key points now.

Agreed. I don't have merge privilege but maybe @alamb can pull the trigger?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ensure Substrait producer for BinaryExpr includes output_type

4 participants