Skip to content

Parallelize graph writes#542

Merged
tlwillke merged 19 commits intomainfrom
parallelize_graph_writes
Nov 26, 2025
Merged

Parallelize graph writes#542
tlwillke merged 19 commits intomainfrom
parallelize_graph_writes

Conversation

@MarkWolters
Copy link
Copy Markdown
Contributor

@MarkWolters MarkWolters commented Oct 15, 2025

This PR adds the option to specify that graph indexes should be written in parallel rather than sequentially. By default the existing sequential write behavior is maintained, parallel writes will only be used when the withParallelWrites(true) option is set through the OnDiskGraphIndexWriter.Builder class. Testing results below show the speedup achieved in the write phase across a number of cores. These gains appear to scale linearly with respect to dataset size (ie writing a dataset of 10 million records will take about 10x as long as a dataset of 1 million records but the speedup in parallel v sequential is roughly equal).

ETA: Testing also showed that the performance gains using a "prod-like" i3.4xlarge w/16 vCPUs and 8 disks striped RAID0 were roughly equivalent to the performance gains using a 64 vCPU m5.16xlarge with standard SSD

sequential parallel speedup

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Oct 15, 2025

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you trigger and review regression testing results against the base branch via Run Bench Main?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

@MarkWolters MarkWolters requested a review from Copilot October 21, 2025 11:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces parallel write capability for graph indexes to improve write throughput. The implementation maintains backward compatibility by defaulting to sequential writes unless explicitly enabled via withParallelWrites(true). Testing shows linear performance scaling with dataset size, with speedups consistent across different hardware configurations.

Key changes:

  • Added ParallelGraphWriter class that orchestrates parallel L0 record building using thread pools and asynchronous file I/O
  • Introduced ByteBufferIndexWriter for in-memory record construction before bulk disk writes
  • Added withParallelWrites(boolean) builder option to OnDiskGraphIndexWriter.Builder

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
ParallelWriteExample.java Example demonstrating parallel vs sequential write usage patterns and benchmark comparison
Grid.java Enables parallel writes in the production Grid.buildOnDisk method
ParallelGraphWriter.java Core parallel writer implementation with thread pooling, memory-aware backpressure, and async I/O
OnDiskGraphIndexWriter.java Updated to support optional parallel write mode via builder configuration
NodeRecordTask.java Task implementation for building individual node records in parallel worker threads
GraphIndexWriterTypes.java New enum defining available writer types (sequential vs parallel)
GraphIndexWriter.java Added factory methods for creating appropriate writer builders based on type
ByteBufferIndexWriter.java IndexWriter implementation for writing to ByteBuffers in memory

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@sam-herman
Copy link
Copy Markdown
Contributor

sam-herman commented Oct 21, 2025

@MarkWolters this is very cool change!

re:

sequential writes will only be used when the withParallelWrites(true) option is set through the OnDiskGraphIndexWriter.Builder class

Did you mean parallel writes will only be used when the withParallelWrites(true) option is set?

Copy link
Copy Markdown
Contributor

@sam-herman sam-herman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more comments, primarily around two aspects:

  1. Calculation of buffer - Seems like we need to consider alignment and hardware parallelism to reduce buffer sizes.
  2. Batching and GC and malloc overhead - It seems as if we are creating a task per ordinal, which seems like something that could be somewhat expensive on memory allocation and GC overhead.

Copy link
Copy Markdown
Contributor

@marianotepper marianotepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread benchmarks-jmh/scripts/test_node_setup.sh
Copy link
Copy Markdown
Collaborator

@tlwillke tlwillke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address the comments. Other LGTM.

Copy link
Copy Markdown
Collaborator

@tlwillke tlwillke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for resolving all of the comments.

@tlwillke tlwillke merged commit 019a241 into main Nov 26, 2025
12 checks passed
@tlwillke tlwillke deleted the parallelize_graph_writes branch November 26, 2025 00:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants