Simplify get_field over inline struct constructors#22239
Conversation
Add a logical simplifier rule that resolves field access against inline
`named_struct(...)` and `struct(...)` constructors at plan time, so a
struct that is built and immediately read back never needs to be
materialized:
* `get_field(named_struct('min', a, 'max', b), 'max')` => `b`
* `get_field(struct(a, b), 'c1')` => `b`
* nested constructors collapse all the way through
The rewrite lives in `GetFieldFunc::simplify` (the same hook that already
flattens nested `get_field` calls) and bails out whenever it cannot be
proven safe.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| }; | ||
| assert_eq!(simplified(args), col("b")); | ||
| } | ||
|
|
There was a problem hiding this comment.
Small readability nit: this test reconstructs get_field(...), immediately destructures it back into args, and then calls simplified(args). The neighboring tests already pass args directly, which feels a bit easier to follow. It may be worth switching this one to the same style for consistency.
There was a problem hiding this comment.
Done in 11a390e — the test now builds args directly and passes it to simplified(args), matching the neighboring tests.
| Arc::new(ScalarUDF::new_from_impl(GetFieldFunc::new())), | ||
| new_args, | ||
| ))) | ||
| } |
There was a problem hiding this comment.
The re-wrap path creates a fresh Arc::new(ScalarUDF::new_from_impl(GetFieldFunc::new())) each time it builds an intermediate get_field node. Since the same construction already appears elsewhere in simplify, it might be nice to centralize this behind a small helper like get_field_udf() or a shared OnceLock. Not a big deal performance-wise, but it would make the intent and reuse a bit clearer.
There was a problem hiding this comment.
Done in cd20ebb — added a get_field_udf() helper backed by a OnceLock and routed both construction sites (the re-wrap path here and the flatten path in simplify) through it.
| .try_as_str() | ||
| .flatten() | ||
| .filter(|s| !s.is_empty())?; | ||
|
|
There was a problem hiding this comment.
try_as_str().flatten() correctly handles ScalarValue::Utf8(None) here, so the guard is safe. It could still be nice to add a dedicated test for a null literal field name, something like simplify_get_field_null_field_name_left_alone, just to document the invariant explicitly.
There was a problem hiding this comment.
Done in b2e0978 — added simplify_get_field_null_field_name_left_alone covering the ScalarValue::Utf8(None) case, asserting the expression is left unchanged.
| 02)--DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/data/composite_order.csv]]}, projection=[named_struct(a, a@0, b, b@1) as s], file_type=csv, has_header=true | ||
|
|
||
| # Simple column ordering tests using a table ordered by (a) | ||
| statement ok |
There was a problem hiding this comment.
One thing that stood out in the updated plan is that the sort key now evaluates get_field(s@0, a) against the materialized struct instead of sorting directly on an extracted column. Functionally this looks correct, but it may introduce some extra per-row sort-key evaluation work compared with the previous shape. Might be worth tracking in a follow-up issue whether physical sort-key normalization could simplify get_field(named_struct(...), f) back into a direct column reference.
There was a problem hiding this comment.
Good observation. Filed a tracking issue for the physical sort-key normalization opportunity: #22487. As noted there, this is a tradeoff rather than a clear regression (sort keys are materialized into an array once before sorting, and the new shape drops the extra extracted column + recovery projection), so the issue calls for a benchmark before investing in the rewrite.
|
|
||
| impl GetFieldFunc { | ||
| pub fn new() -> Self { | ||
| Self { |
There was a problem hiding this comment.
This simplification can eliminate evaluation of struct fields that are never accessed, for example get_field(named_struct('a', 1/0, 'b', x), 'b') simplifying down to x. That matches the existing immutable-expression optimizer contract, but I think a short doc comment here would help future readers understand that unused field expressions are intentionally not evaluated after simplification.
There was a problem hiding this comment.
Done in 4f70d80 — added a paragraph to the function doc comment noting that replacing the access drops the other fields' expressions so they are no longer evaluated (e.g. get_field(named_struct('a', 1/0, 'b', x), 'b') => x), which matches the optimizer's immutable-expression contract.
| if format!("c{index}") != field_name { | ||
| return None; | ||
| } | ||
| ctor_args.get(index)?.clone() |
There was a problem hiding this comment.
The early return None on a dynamic field name makes sense because a later runtime name could shadow an earlier literal match. One subtle detail is that this also prevents simplification even when a matching literal field was already seen earlier in the loop. It may be worth calling out in the comment that this is a deliberate conservative approximation rather than an accidental missed optimization.
There was a problem hiding this comment.
Done — expanded the comment to call out that bailing on a non-literal field name is a deliberate conservative approximation rather than an accidental missed optimization.
To be precise about the rationale: a non-literal name appearing after the first literal match can't change the result, since column_by_name returns the first match. The bail is only strictly required when the non-literal name appears before the first match (it could shadow it at runtime); applying it unconditionally is the conservative-for-simplicity choice. Reworded the comment accordingly in f799333 (force-pushed; new branch tip b2e0978).
|
thank you @kosiew . i will go through the feedback next week! |
Both the re-wrap path in simplify_get_field_over_struct_constructor and the flatten path in GetFieldFunc::simplify built a fresh Arc::new(ScalarUDF::new_from_impl(GetFieldFunc::new())) each time. Share a single OnceLock-backed instance via get_field_udf() to make the reuse and intent clearer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Document on simplify_get_field_over_struct_constructor that replacing the
access with the selected field drops the expressions for the other fields,
so they are no longer evaluated (e.g. get_field(named_struct('a', 1/0, 'b',
x), 'b') => x). This matches the optimizer's immutable-expression contract.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrite the comment in the named_struct branch to state the rationale accurately. A non-literal name appearing *before* the first match could evaluate to field_name at runtime and become the real first match, so we must bail. But once a literal match has been found, a later non-literal name can never precede it (column_by_name returns the first match), so bailing there is a deliberate approximation we accept for simplicity, not a correctness requirement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
simplify_get_field_named_struct_returns_matching_value built a get_field expr only to immediately destructure it back into args. Pass args directly, matching the style of the neighboring tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add simplify_get_field_null_field_name_left_alone to explicitly cover the ScalarValue::Utf8(None) case, where try_as_str().flatten() yields no field name and the expression must be left unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
52e7567 to
b2e0978
Compare
|
@kosiew I think I've addressed the feedback, ready for another round of review 🙏🏻 |
Which issue does this PR close?
#22240
Rationale for this change
Constructing a struct with
named_struct(...)(orstruct(...)) and then immediately reading a field back out of it is pure overhead — the intermediate struct never needs to be materialized. This pattern shows up after view/CTE inlining and projection pushdown, where anamed_structprojection feeds aget_fieldin a parent node.For example:
get_field(named_struct('type', type, 'value', value), 'type')is equivalent totype, so the simplifier can drop the struct entirely. This is especially important because without this simplification no statistics pruning can be applied.What changes are included in this PR?
A new logical simplification, added to
GetFieldFunc::simplify(the same hook that already flattens nestedget_fieldcalls):get_field(named_struct('min', a, 'max', b), 'max')=>b(lookup by name)get_field(struct(a, b), 'c1')=>b(positionalc0,c1, ... fields)named_struct('outer', named_struct('inner', a))['outer']['inner']=>aThe rewrite is conservative and bails out (leaving the expression untouched) whenever it cannot be proven safe:
named_structwith a non-literal field name (which could shadow the requested field at runtime),structfield spellings such asc01.Casts are intentionally not unwrapped: a struct→struct cast can rename, retype and reorder fields, so resolving through one correctly is a larger, separate change.
Are these changes tested?
Yes:
getfield.rscovering matches, duplicate names, nested constructors, flatten-then-resolve, and every bail-out guard.struct.slt: query +EXPLAINtests showing the field access collapses to the underlying column.order.slt: twoEXPLAINexpectations updated — resolvingget_field(named_struct(...), 'a')lets theextract_leaf_expressionsrule skip a now-pointless sort-key extraction. TheSortExecis still present, as those tests intend, and the sort-elimination cases are unchanged.The full
sqllogictestsuite, the optimizer crate tests,cargo clippy --all-targets --all-features -D warningsandcargo fmtall pass.Are there any user-facing changes?
No API changes. Query plans involving
get_fieldover an inline struct constructor are simpler; results are unchanged.🤖 Generated with Claude Code