Skip to content

fix: reduce memory allocation overhead during partial aggregation ear…#22165

Open
ariel-miculas wants to merge 1 commit into
apache:mainfrom
ariel-miculas:fix-extra-memory-allocated-in-take-n
Open

fix: reduce memory allocation overhead during partial aggregation ear…#22165
ariel-miculas wants to merge 1 commit into
apache:mainfrom
ariel-miculas:fix-extra-memory-allocated-in-take-n

Conversation

@ariel-miculas
Copy link
Copy Markdown
Contributor

…ly emit

Which issue does this PR close?

Rationale for this change

When the partial hash aggregation (with GroupOrdering::None) cannot reserve additional memory, it calls self.emit(EmitTo::First(n), where n is the largest integer multiple of self.batch_size, meaning it's close to the length of group_values. Using drain + collect leads to an additional copy of all these n values, since collect allocates a new Vec and drain doesn't modify the original vector's capacity. To avoid copying the largest allocation, choose a strategy which always allocates the minimum between n and remaining.

What changes are included in this PR?

Are these changes tested?

Yes, added unit tests

Are there any user-facing changes?

No

…ly emit

When the partial hash aggregation (with GroupOrdering::None) cannot
reserve additional memory, it calls self.emit(EmitTo::First(n), where n
is the largest integer multiple of self.batch_size, meaning it's close
to the length of group_values.  Using drain + collect leads to an
additional copy of all these n values, since collect allocates a new Vec
and drain doesn't modify the original vector's capacity. To avoid
copying the largest allocation, choose a strategy which always allocates
the minimum between n and remaining.

Fixes apache#22164
@github-actions github-actions Bot added the physical-plan Changes to the physical-plan crate label May 14, 2026
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me -- thanks @ariel-miculas

@alamb alamb added the performance Make DataFusion faster label May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Make DataFusion faster physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extra memory allocated during partial aggregation early emit during OOM handling

2 participants