Conversation
|
Before you submit for review:
If you did not complete any of these, then please explain below. |
There was a problem hiding this comment.
Pull Request Overview
This PR introduces parallel write capability for graph indexes to improve write throughput. The implementation maintains backward compatibility by defaulting to sequential writes unless explicitly enabled via withParallelWrites(true). Testing shows linear performance scaling with dataset size, with speedups consistent across different hardware configurations.
Key changes:
- Added
ParallelGraphWriterclass that orchestrates parallel L0 record building using thread pools and asynchronous file I/O - Introduced
ByteBufferIndexWriterfor in-memory record construction before bulk disk writes - Added
withParallelWrites(boolean)builder option toOnDiskGraphIndexWriter.Builder
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| ParallelWriteExample.java | Example demonstrating parallel vs sequential write usage patterns and benchmark comparison |
| Grid.java | Enables parallel writes in the production Grid.buildOnDisk method |
| ParallelGraphWriter.java | Core parallel writer implementation with thread pooling, memory-aware backpressure, and async I/O |
| OnDiskGraphIndexWriter.java | Updated to support optional parallel write mode via builder configuration |
| NodeRecordTask.java | Task implementation for building individual node records in parallel worker threads |
| GraphIndexWriterTypes.java | New enum defining available writer types (sequential vs parallel) |
| GraphIndexWriter.java | Added factory methods for creating appropriate writer builders based on type |
| ByteBufferIndexWriter.java | IndexWriter implementation for writing to ByteBuffers in memory |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
@MarkWolters this is very cool change! re:
Did you mean parallel writes will only be used when the withParallelWrites(true) option is set? |
There was a problem hiding this comment.
Added more comments, primarily around two aspects:
- Calculation of buffer - Seems like we need to consider alignment and hardware parallelism to reduce buffer sizes.
- Batching and GC and malloc overhead - It seems as if we are creating a task per ordinal, which seems like something that could be somewhat expensive on memory allocation and GC overhead.
tlwillke
left a comment
There was a problem hiding this comment.
Please address the comments. Other LGTM.
tlwillke
left a comment
There was a problem hiding this comment.
Thanks for resolving all of the comments.
This PR adds the option to specify that graph indexes should be written in parallel rather than sequentially. By default the existing sequential write behavior is maintained, parallel writes will only be used when the withParallelWrites(true) option is set through the OnDiskGraphIndexWriter.Builder class. Testing results below show the speedup achieved in the write phase across a number of cores. These gains appear to scale linearly with respect to dataset size (ie writing a dataset of 10 million records will take about 10x as long as a dataset of 1 million records but the speedup in parallel v sequential is roughly equal).
ETA: Testing also showed that the performance gains using a "prod-like" i3.4xlarge w/16 vCPUs and 8 disks striped RAID0 were roughly equivalent to the performance gains using a 64 vCPU m5.16xlarge with standard SSD