perf: optimize object store requests when reading CSV#22962
Conversation
| /// | ||
| /// `usize::MAX` is intentionally included: `ChunkedStore` treats it as | ||
| /// "one chunk containing everything", giving the single-chunk fast path. | ||
| pub const CHUNK_SIZES: &[usize] = &[1, 2, 3, 4, 5, 7, 8, 11, 13, 16, usize::MAX]; |
There was a problem hiding this comment.
I copied this from #20823, so this is duplicate code that's also in datasource-json - AlignedBoundaryStream was copied from there but the unit tests needed this helper which lives behind #[cfg(test)] in the JSON crate lol, so I just copied - if there's a better approach here to deduplicate please let me know and I'm happy to modify this!
There was a problem hiding this comment.
- can't we gate these under
#[cfg(test)]? - do these have to be
pub?
There was a problem hiding this comment.
could you also explain the original issue for why these are needed here?
There was a problem hiding this comment.
This is gated under #[cfg(test)] in mod.rs! But I modified this to pub(crate) :)
I put these here to test AlignedBoundaryStream itself, making sure the stream produces the correct raw bytes - the make_chunked_store in the JSON crate looks to be used for exercising the full JSON parsing pipeline end-to-end!
But happy to remove it you think it's redundant
There was a problem hiding this comment.
I see now, since AlignedBoundaryStream was moved, it no longer has access to make_chunked_store from datasource-json. The duplication is unfortunate but I don't know if there's a better solution.
There was a problem hiding this comment.
Haha yeah, I was banging my head against the wall to try and avoid the duplication but couldn't find a way either 😭
|
@ariel-miculas tagging you as the issue creator & author of #20823 which I based this on! @martin-g / @Weijun-H / @alamb - saw that you all helped with the review of the above JSON PR, so tagging you folks as well! Please let me know your thoughts - super excited for my first contribution to Datafusion! 😁 |
e09359b to
6c89916
Compare
|
Thanks, @saadtajwar , looks excellent! |
Thanks @ariel-miculas ! Hmm, I see the |
|
TPC-H CSV benchmark - main vs branch with
Note that I ran these locally on my machine! Every query is faster, with overall speedup of 1.7x |
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
|
Thanks so much @ariel-miculas ! Excited to get this one in! Anything needed on my end here? I see one of the CI checks ( |
You need an official approval, maybe @alamb could help you. |

Which issue does this PR close?
Rationale for this change
The CSV scanner currently uses
calculate_rangewhich issues two extraget_optsrequests per byte range to find newline boundaries (one for the start boundary, one for the end boundary), plus one GET for the actual data. For a file split into 3 partitions, this results in 8 total object store requests.#20823 solved this same problem for the JSON scanner by introducing
AlignedBoundaryStream, which wraps the raw byte stream and lazily aligns to newline boundaries as data is read, eliminating the extra boundary-seeking requests entirely. This PR applies the same approach to CSV.What changes are included in this PR?
Based on the approach from #20823:
Moved
AlignedBoundaryStreamfromdatasource-jsonto the shareddatasourcecrate so it can be reused by both JSON and CSV scanners. UpdatedCsvOpenerto use instead ofcalculate_range, and removed thecalculate_range&find_first_newlineas they no longer had any callers. Updated tests to reflect.Note that
RangeCalculationis left in place as it is a public API item, even though it no longer has any consumers.Are these changes tested?
Yes. The existing
AlignedBoundaryStreamunit tests (16 tests covering boundary alignment edge cases) were moved along with the implementation and continue to pass. Thequery_csv_file_with_byte_range_partitionssnapshot test inobject_store_access.rshas been updated to verify the new request pattern (4 requests instead of 8).Are there any user-facing changes?
No.