feat(engine): add hash-based sparse weight sync for FSDP and Archon by rchardx · Pull Request #1127 · inclusionAI/AReaL

rchardx · 2026-04-01T08:28:03Z

Description

Add hash-based sparse weight synchronization so FSDP and Archon can skip broadcasting unchanged parameter shards during distributed weight updates. This also switches exchanged hash tensors to signed int64 so NCCL collectives can move hash values safely, while keeping the feature opt-in through sparse_weight_sync=False by default.

Related Issue

Refs #1125

Type of Change

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (if applicable; built with ./docs/build_all.sh)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Breaking Change Details (if applicable):

N/A

Additional Context

Introduces reusable tensor-hash helpers plus regression tests for signed wraparound behavior.
Adds sparse weight sync to both FSDP and Archon paths so only changed parameters are broadcast after synchronized hash comparison.
Exposes sparse_weight_sync in TrainEngineConfig and regenerates the English and Chinese CLI reference docs.
Validation run locally: uv run pytest tests/test_tensor_hash.py and pre-commit run --all-files (with the repo virtualenv on PATH).
GPU and distributed integration suites were not run in this environment.

gemini-code-assist

Code Review

This pull request introduces hash-based sparse weight updates to the FSDP and Archon engines, enabling the skipping of unchanged parameters during broadcasts to reduce communication overhead. The implementation adds state management for parameter hashes, two-pass change detection logic, and rank-consistency verification. Review feedback highlights high-severity performance issues resulting from synchronous .item() calls on CUDA tensors within loops and a bug in the Archon engine's verification logic due to the use of non-deterministic Python hashing. It is also recommended to pass materialized parameter lists to hash computation functions to eliminate redundant iterations.

Reduce distributed weight sync traffic by hashing parameter shards and only broadcasting tensors that changed since the last successful update. Keep the feature opt-in and use signed int64 hashes so NCCL collectives can exchange hash values safely. Key changes: - add reusable tensor hash helpers and regression tests - add sparse weight sync support to FSDP and Archon engines - expose sparse_weight_sync in TrainEngineConfig and regenerate CLI docs

gemini-code-assist bot reviewed Apr 1, 2026

View reviewed changes

Comment thread areal/engine/fsdp_engine.py Outdated

Comment thread areal/experimental/engine/archon_weight_sync.py Outdated

Comment thread areal/experimental/engine/archon_weight_sync.py Outdated

Comment thread areal/experimental/engine/archon_weight_sync.py Outdated

rchardx force-pushed the rchardx/grad branch from 488d8ca to 117ee30 Compare April 1, 2026 11:39

rchardx changed the title ~~feat(engine): hash-based sparse weight update for xccl path~~ feat(engine): add hash-based sparse weight updates Apr 1, 2026

rchardx force-pushed the rchardx/grad branch 2 times, most recently from d8a74de to ae9cb33 Compare April 2, 2026 04:35

rchardx changed the title ~~feat(engine): add hash-based sparse weight updates~~ feat(engine): add hash-based sparse weight sync with configurable toggle Apr 2, 2026

rchardx force-pushed the rchardx/grad branch from bd35595 to b1a7b09 Compare April 2, 2026 06:00

rchardx changed the title ~~feat(engine): add hash-based sparse weight sync with configurable toggle~~ feat(engine): add hash-based sparse weight sync Apr 2, 2026

rchardx force-pushed the rchardx/grad branch from b1a7b09 to 788f1f2 Compare April 2, 2026 09:56

rchardx force-pushed the rchardx/grad branch from 10280a8 to d10ae92 Compare April 2, 2026 11:23

rchardx changed the title ~~feat(engine): add hash-based sparse weight sync~~ feat(engine): add hash-based sparse weight sync for FSDP and Archon Apr 2, 2026

rchardx closed this Apr 10, 2026

rchardx deleted the rchardx/grad branch April 10, 2026 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(engine): add hash-based sparse weight sync for FSDP and Archon#1127

feat(engine): add hash-based sparse weight sync for FSDP and Archon#1127
rchardx wants to merge 1 commit intomainfrom
rchardx/grad

rchardx commented Apr 1, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rchardx commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rchardx commented Apr 1, 2026 •

edited

Loading