feat(engine): add hash-based sparse weight sync for FSDP and Archon#1127
Closed
feat(engine): add hash-based sparse weight sync for FSDP and Archon#1127
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces hash-based sparse weight updates to the FSDP and Archon engines, enabling the skipping of unchanged parameters during broadcasts to reduce communication overhead. The implementation adds state management for parameter hashes, two-pass change detection logic, and rank-consistency verification. Review feedback highlights high-severity performance issues resulting from synchronous .item() calls on CUDA tensors within loops and a bug in the Archon engine's verification logic due to the use of non-deterministic Python hashing. It is also recommended to pass materialized parameter lists to hash computation functions to eliminate redundant iterations.
d8a74de to
ae9cb33
Compare
Reduce distributed weight sync traffic by hashing parameter shards and only broadcasting tensors that changed since the last successful update. Keep the feature opt-in and use signed int64 hashes so NCCL collectives can exchange hash values safely. Key changes: - add reusable tensor hash helpers and regression tests - add sparse weight sync support to FSDP and Archon engines - expose sparse_weight_sync in TrainEngineConfig and regenerate CLI docs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add hash-based sparse weight synchronization so FSDP and Archon can skip broadcasting unchanged parameter shards during distributed weight updates. This also switches exchanged hash tensors to signed int64 so NCCL collectives can move hash values safely, while keeping the feature opt-in through
sparse_weight_sync=Falseby default.Related Issue
Refs #1125
Type of Change
Checklist
pre-commit run --all-files)./docs/build_all.sh)main/review-prcommand/create-prBreaking Change Details (if applicable):
N/A
Additional Context
sparse_weight_syncinTrainEngineConfigand regenerates the English and Chinese CLI reference docs.uv run pytest tests/test_tensor_hash.pyandpre-commit run --all-files(with the repo virtualenv on PATH).