Skip to content

perf: optimize object store requests when reading CSV#22962

Open
saadtajwar wants to merge 7 commits into
apache:mainfrom
saadtajwar:saadt/optimize-object-store-accesses-csv
Open

perf: optimize object store requests when reading CSV#22962
saadtajwar wants to merge 7 commits into
apache:mainfrom
saadtajwar:saadt/optimize-object-store-accesses-csv

Conversation

@saadtajwar

@saadtajwar saadtajwar commented Jun 16, 2026

Copy link
Copy Markdown

Which issue does this PR close?

Rationale for this change

The CSV scanner currently uses calculate_range which issues two extra get_opts requests per byte range to find newline boundaries (one for the start boundary, one for the end boundary), plus one GET for the actual data. For a file split into 3 partitions, this results in 8 total object store requests.

#20823 solved this same problem for the JSON scanner by introducing AlignedBoundaryStream, which wraps the raw byte stream and lazily aligns to newline boundaries as data is read, eliminating the extra boundary-seeking requests entirely. This PR applies the same approach to CSV.

What changes are included in this PR?

Based on the approach from #20823:

Moved AlignedBoundaryStream from datasource-json to the shared datasource crate so it can be reused by both JSON and CSV scanners. Updated CsvOpener to use instead of calculate_range, and removed the calculate_range & find_first_newline as they no longer had any callers. Updated tests to reflect.

Note thatRangeCalculation is left in place as it is a public API item, even though it no longer has any consumers.

Are these changes tested?

Yes. The existing AlignedBoundaryStream unit tests (16 tests covering boundary alignment edge cases) were moved along with the implementation and continue to pass. The query_csv_file_with_byte_range_partitions snapshot test in object_store_access.rs has been updated to verify the new request pattern (4 requests instead of 8).

Are there any user-facing changes?

No.

@github-actions github-actions Bot added core Core DataFusion crate datasource Changes to the datasource crate labels Jun 16, 2026
Comment thread datafusion/datasource/src/test_util.rs Outdated
///
/// `usize::MAX` is intentionally included: `ChunkedStore` treats it as
/// "one chunk containing everything", giving the single-chunk fast path.
pub const CHUNK_SIZES: &[usize] = &[1, 2, 3, 4, 5, 7, 8, 11, 13, 16, usize::MAX];

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this from #20823, so this is duplicate code that's also in datasource-json - AlignedBoundaryStream was copied from there but the unit tests needed this helper which lives behind #[cfg(test)] in the JSON crate lol, so I just copied - if there's a better approach here to deduplicate please let me know and I'm happy to modify this!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. can't we gate these under #[cfg(test)]?
  2. do these have to be pub?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also explain the original issue for why these are needed here?

@saadtajwar saadtajwar Jun 16, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is gated under #[cfg(test)] in mod.rs! But I modified this to pub(crate) :)

I put these here to test AlignedBoundaryStream itself, making sure the stream produces the correct raw bytes - the make_chunked_store in the JSON crate looks to be used for exercising the full JSON parsing pipeline end-to-end!

But happy to remove it you think it's redundant

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now, since AlignedBoundaryStream was moved, it no longer has access to make_chunked_store from datasource-json. The duplication is unfortunate but I don't know if there's a better solution.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha yeah, I was banging my head against the wall to try and avoid the duplication but couldn't find a way either 😭

@saadtajwar

saadtajwar commented Jun 16, 2026

Copy link
Copy Markdown
Author

@ariel-miculas tagging you as the issue creator & author of #20823 which I based this on!

@martin-g / @Weijun-H / @alamb - saw that you all helped with the review of the above JSON PR, so tagging you folks as well! Please let me know your thoughts - super excited for my first contribution to Datafusion! 😁

@saadtajwar saadtajwar force-pushed the saadt/optimize-object-store-accesses-csv branch from e09359b to 6c89916 Compare June 16, 2026 01:04
Comment thread datafusion/datasource/src/boundary_stream.rs Outdated
@ariel-miculas

Copy link
Copy Markdown
Contributor

Thanks, @saadtajwar , looks excellent!
Are there any benchmarks that you could run, like clickbench with SIMULATE_LATENCY as I did here? #21478 (comment)
Haven't looked if there's CSV support, JSON support was missing when I ran these tests (#21446)

@saadtajwar

Copy link
Copy Markdown
Author

Thanks, @saadtajwar , looks excellent! Are there any benchmarks that you could run, like clickbench with SIMULATE_LATENCY as I did here? #21478 (comment) Haven't looked if there's CSV support, JSON support was missing when I ran these tests (#21446)

Thanks @ariel-miculas ! Hmm, I see the tpch_csv benchmark which looks like it would exercise the byte-range repartioning path - I'll run that now with SIMULATE_LATENCY and report back the results! The Clickbench benchmarks don't seem to support CSV from what I can see

@saadtajwar

saadtajwar commented Jun 16, 2026

Copy link
Copy Markdown
Author

TPC-H CSV benchmark - main vs branch with SIMULATE_LATENCY: @ariel-miculas

Query main (ms) PR (ms) Change Speedup
Q01 552.7 372.1 -32.7% 1.49x
Q02 2614.8 1535.2 -41.3% 1.70x
Q03 1445.4 829.6 -42.6% 1.74x
Q04 996.4 582.2 -41.6% 1.71x
Q05 2004.6 1209.2 -39.7% 1.66x
Q06 543.9 357.3 -34.3% 1.52x
Q07 2051.2 1226.6 -40.2% 1.67x
Q08 2598.6 1544.3 -40.6% 1.68x
Q09 2453.5 1363.7 -44.4% 1.80x
Q10 1563.5 924.0 -40.9% 1.69x
Q11 2102.4 1279.4 -39.1% 1.64x
Q12 1027.2 595.3 -42.0% 1.73x
Q13 933.9 522.8 -44.0% 1.79x
Q14 1003.5 575.5 -42.7% 1.74x
Q15 1563.2 946.1 -39.5% 1.65x
Q16 1387.0 774.6 -44.2% 1.79x
Q17 1526.5 878.8 -42.4% 1.74x
Q18 1889.0 1148.7 -39.2% 1.64x
Q19 1004.2 584.5 -41.8% 1.72x
Q20 1900.8 1092.7 -42.5% 1.74x
Q21 2508.2 1478.9 -41.0% 1.70x
Q22 1452.8 827.7 -43.0% 1.76x
Total 35,123 20,649 -41.2% 1.70x

Note that I ran these locally on my machine! Every query is faster, with overall speedup of 1.7x

@github-actions

Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion v54.0.0 (current)
       Built [  81.814s] (current)
     Parsing datafusion v54.0.0 (current)
      Parsed [   0.029s] (current)
    Building datafusion v54.0.0 (baseline)
       Built [  81.686s] (baseline)
     Parsing datafusion v54.0.0 (baseline)
      Parsed [   0.030s] (baseline)
    Checking datafusion v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.721s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 167.144s] datafusion
    Building datafusion-datasource v54.0.0 (current)
       Built [  29.238s] (current)
     Parsing datafusion-datasource v54.0.0 (current)
      Parsed [   0.027s] (current)
    Building datafusion-datasource v54.0.0 (baseline)
       Built [  29.009s] (baseline)
     Parsing datafusion-datasource v54.0.0 (baseline)
      Parsed [   0.027s] (baseline)
    Checking datafusion-datasource v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.281s] 223 checks: 222 pass, 1 fail, 0 warn, 30 skip

--- failure function_missing: pub fn removed or renamed ---

Description:
A publicly-visible function cannot be imported by its prior path. A `pub use` may have been removed, or the function itself may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/function_missing.ron

Failed in:
  function datafusion_datasource::calculate_range, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/c92f21c79373eb786fe8d4c9a8bf4cc504222670/datafusion/datasource/src/mod.rs:442

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  59.507s] datafusion-datasource
    Building datafusion-datasource-csv v54.0.0 (current)
       Built [  30.342s] (current)
     Parsing datafusion-datasource-csv v54.0.0 (current)
      Parsed [   0.010s] (current)
    Building datafusion-datasource-csv v54.0.0 (baseline)
       Built [  29.482s] (baseline)
     Parsing datafusion-datasource-csv v54.0.0 (baseline)
      Parsed [   0.010s] (baseline)
    Checking datafusion-datasource-csv v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.119s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  60.824s] datafusion-datasource-csv
    Building datafusion-datasource-json v54.0.0 (current)
       Built [  29.358s] (current)
     Parsing datafusion-datasource-json v54.0.0 (current)
      Parsed [   0.010s] (current)
    Building datafusion-datasource-json v54.0.0 (baseline)
       Built [  29.727s] (baseline)
     Parsing datafusion-datasource-json v54.0.0 (baseline)
      Parsed [   0.011s] (baseline)
    Checking datafusion-datasource-json v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.098s] 223 checks: 220 pass, 3 fail, 0 warn, 30 skip

--- failure module_missing: pub module removed or renamed ---

Description:
A publicly-visible module cannot be imported by its prior path. A `pub use` may have been removed, or the module may have been renamed, removed, or made non-public.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/module_missing.ron

Failed in:
  mod datafusion_datasource_json::boundary_stream, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/c92f21c79373eb786fe8d4c9a8bf4cc504222670/datafusion/datasource-json/src/boundary_stream.rs:18

--- failure pub_module_level_const_missing: pub module-level const is missing ---

Description:
A public const is missing or renamed
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/pub_module_level_const_missing.ron

Failed in:
  END_SCAN_LOOKAHEAD in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/c92f21c79373eb786fe8d4c9a8bf4cc504222670/datafusion/datasource-json/src/boundary_stream.rs:36

--- failure struct_missing: pub struct removed or renamed ---

Description:
A publicly-visible struct cannot be imported by its prior path. A `pub use` may have been removed, or the struct itself may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/struct_missing.ron

Failed in:
  struct datafusion_datasource_json::boundary_stream::AlignedBoundaryStream, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/c92f21c79373eb786fe8d4c9a8bf4cc504222670/datafusion/datasource-json/src/boundary_stream.rs:64

     Summary semver requires new major version: 3 major and 0 minor checks failed
    Finished [  60.108s] datafusion-datasource-json

@github-actions github-actions Bot added the auto detected api change Auto detected API change label Jun 17, 2026
@saadtajwar

Copy link
Copy Markdown
Author

Thanks so much @ariel-miculas ! Excited to get this one in!

Anything needed on my end here? I see one of the CI checks (Rust / cargo examples (amd64) (pull_request)) failing, but this seems to be caused by a (hopefully ephemeral?) error? I can push an empty commit to re-run if needed!

Run # test datafusion-sql examples
    Updating crates.io index
error: failed to get `windows` as a dependency of package `sysinfo v0.39.3`
    ... which satisfies dependency `sysinfo = "^0.39.3"` (locked to 0.39.3) of package `datafusion v54.0.0 (/__w/datafusion/datafusion/datafusion/core)`

Caused by:
  failed to load source for dependency `windows`

Caused by:
  unable to update registry `crates-io`

Caused by:
  download of wi/nd/windows failed

Caused by:
  curl failed

Caused by:
  [16] Error in the HTTP2 framing layer
Error: Process completed with exit code 101.

@ariel-miculas

Copy link
Copy Markdown
Contributor

Anything needed on my end here?

You need an official approval, maybe @alamb could help you.

@saadtajwar

Copy link
Copy Markdown
Author

Anything needed on my end here?

You need an official approval, maybe @alamb could help you.

@alamb would love a review here whenever you have a chance! Thanks in advance for any feedback! Super excited about contributing to Datafusion! 😁

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change core Core DataFusion crate datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize object store accesses for the CSV scanner

2 participants