Skip to content

feat(engine): add hash-based sparse weight sync for FSDP and Archon#1127

Closed
rchardx wants to merge 1 commit intomainfrom
rchardx/grad
Closed

feat(engine): add hash-based sparse weight sync for FSDP and Archon#1127
rchardx wants to merge 1 commit intomainfrom
rchardx/grad

Conversation

@rchardx
Copy link
Copy Markdown
Collaborator

@rchardx rchardx commented Apr 1, 2026

Description

Add hash-based sparse weight synchronization so FSDP and Archon can skip broadcasting unchanged parameter shards during distributed weight updates. This also switches exchanged hash tensors to signed int64 so NCCL collectives can move hash values safely, while keeping the feature opt-in through sparse_weight_sync=False by default.

Related Issue

Refs #1125

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📝 Documentation update
  • ♻️ Refactoring
  • ⚡ Performance improvement
  • ✅ Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Breaking Change Details (if applicable):

N/A

Additional Context

  • Introduces reusable tensor-hash helpers plus regression tests for signed wraparound behavior.
  • Adds sparse weight sync to both FSDP and Archon paths so only changed parameters are broadcast after synchronized hash comparison.
  • Exposes sparse_weight_sync in TrainEngineConfig and regenerates the English and Chinese CLI reference docs.
  • Validation run locally: uv run pytest tests/test_tensor_hash.py and pre-commit run --all-files (with the repo virtualenv on PATH).
  • GPU and distributed integration suites were not run in this environment.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces hash-based sparse weight updates to the FSDP and Archon engines, enabling the skipping of unchanged parameters during broadcasts to reduce communication overhead. The implementation adds state management for parameter hashes, two-pass change detection logic, and rank-consistency verification. Review feedback highlights high-severity performance issues resulting from synchronous .item() calls on CUDA tensors within loops and a bug in the Archon engine's verification logic due to the use of non-deterministic Python hashing. It is also recommended to pass materialized parameter lists to hash computation functions to eliminate redundant iterations.

Comment thread areal/engine/fsdp_engine.py Outdated
Comment thread areal/experimental/engine/archon_weight_sync.py Outdated
Comment thread areal/experimental/engine/archon_weight_sync.py Outdated
Comment thread areal/experimental/engine/archon_weight_sync.py Outdated
@rchardx rchardx changed the title feat(engine): hash-based sparse weight update for xccl path feat(engine): add hash-based sparse weight updates Apr 1, 2026
@rchardx rchardx force-pushed the rchardx/grad branch 2 times, most recently from d8a74de to ae9cb33 Compare April 2, 2026 04:35
@rchardx rchardx changed the title feat(engine): add hash-based sparse weight updates feat(engine): add hash-based sparse weight sync with configurable toggle Apr 2, 2026
@rchardx rchardx changed the title feat(engine): add hash-based sparse weight sync with configurable toggle feat(engine): add hash-based sparse weight sync Apr 2, 2026
Reduce distributed weight sync traffic by hashing parameter shards and
only broadcasting tensors that changed since the last successful
update. Keep the feature opt-in and use signed int64 hashes so NCCL
collectives can exchange hash values safely.

Key changes:
- add reusable tensor hash helpers and regression tests
- add sparse weight sync support to FSDP and Archon engines
- expose sparse_weight_sync in TrainEngineConfig and regenerate CLI docs
@rchardx rchardx changed the title feat(engine): add hash-based sparse weight sync feat(engine): add hash-based sparse weight sync for FSDP and Archon Apr 2, 2026
@rchardx rchardx closed this Apr 10, 2026
@rchardx rchardx deleted the rchardx/grad branch April 10, 2026 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant