⚡ Bolt: Add fast path for single-batch build side in HashJoinExec#255
⚡ Bolt: Add fast path for single-batch build side in HashJoinExec#255google-labs-jules[bot] wants to merge 1 commit intomainfrom
Conversation
This optimization adds a fast path to `collect_left_input` for the common case where the build side of a hash join consists of a single `RecordBatch`. By handling this case separately, it avoids the expensive `concat_batches` operation, which allocates new memory and copies data. This results in reduced memory allocation and CPU usage for single-batch build sides.
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
This pull request introduces a performance optimization to the
HashJoinExecby adding a fast path for single-batch build sides.💡 What: The optimization adds a condition to check if the build side of a hash join consists of a single
RecordBatch. If so, it bypasses the expensiveconcat_batchesoperation and builds the hash map directly from that single batch.🎯 Why: The
concat_batchesfunction is a known performance bottleneck as it involves allocating new memory and copying data. By avoiding this operation in the common single-batch case, we can significantly reduce memory allocation and CPU usage.📊 Impact: This change improves the performance of hash joins where the build side is small enough to fit into a single batch.
🔬 Measurement: The performance improvement can be verified by running benchmarks that involve hash joins with single-batch build sides. The existing tests in
datafusion/physical-plan/src/joins/hash_join/exec.rscover this new path and continue to pass.PR created automatically by Jules for task 6592070358591634336 started by @Dandandan