FP8 params support for megatron-fsdp (MXFP8/Blockwise) #2239

kunlunl · 2025-11-13T07:54:32Z

What does this PR do ?

dev MR: #2086

Major changes:

Make Megatron-FSDP’s AG pipeline support using different data parallel buffers, because MXFP8 has different quantization direction in forward and backward passes.
Decouple the FP8-related logic from the main workflow and provide a unified abstraction to 1) operate the raw data storage of different recipes; 2) Create or discard transpose cache for different recipes.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share discuss a design-doc with the team.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either [email protected] or [email protected].

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2025-11-13T07:54:35Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Skylion007 · 2025-11-22T19:41:46Z

Is FP8 activations/grad support on Hopper FSDP with block wise support on the roadmap as well? O-o

kunlunl · 2025-11-24T06:04:11Z

Is FP8 activations/grad support on Hopper FSDP with block wise support on the roadmap as well? O-o

@shjwudp

cspades · 2025-12-16T17:40:29Z

See comment on Dev PR: #2086 (comment)

Can we add a few simple unit tests?

Skylion007 · 2025-12-16T20:24:57Z

megatron/core/distributed/fsdp/src/megatron_fsdp/mixed_precision.py

+        return TE_VERSION > PkgVersion(vers)
+
+
+def is_float8tensor(tensor: torch.Tensor) -> bool:


Suggested change

def is_float8tensor(tensor: torch.Tensor) -> bool:

def is_float8tensor(tensor: torch.Tensor) -> TypeGuard[FP8_TENSOR_CLASS]:

shjwudp · 2025-12-18T02:15:24Z

Is FP8 activations/grad support on Hopper FSDP with block wise support on the roadmap as well? O-o

Hi @Skylion007 ,

Do you mean FP8 activations + param support on Hopper? We has already been merged into the dev branch (PR-2086), and we'll look into merging it into the main branch when we have time.

cspades · 2026-01-01T20:07:09Z

/ok to test 42a6bdc

Signed-off-by: kunlunl <[email protected]> Co-authored-by: jianbinc <[email protected]>

shjwudp · 2026-01-05T07:27:11Z

/ok to test feb6753

BoxiangW · 2026-01-05T18:53:53Z

megatron/core/distributed/fsdp/src/megatron_fsdp/megatron_fsdp.py

                # to allocate as little memory as possible for this forward pass.
                param_list = list(module.parameters(recurse=False))

+            if self.enable_fine_grained_param_gather_hook:


BoxiangW · 2026-01-05T18:54:55Z

megatron/core/distributed/fsdp/src/megatron_fsdp/megatron_fsdp.py

+            else:
+                param_list = list(module.parameters(recurse=False))
+
+            if self.enable_fine_grained_param_gather_hook:


Signed-off-by: jianbinc <[email protected]>

handle fp8_tensor _data is None situation

shjwudp · 2026-01-06T07:08:41Z

/ok to test 31624f7

shjwudp · 2026-01-06T14:35:12Z

/ok to test 31624f7

cspades

Finally found time to prototype this backend implementation in fully_shard, I'm generally happy with this PR. I'll submit a follow-up PR directly to main that exposes FP8 parameter support to fully_shard and a unit test for it as well.

@shjwudp @kunlunl I do have a comment beyond the scope of this PR though, pertaining to this fp8_model_init: https://github.com/NVIDIA/Megatron-LM/blob/dev/megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py#L3738

We should move this to mcore_fsdp_adapter.py during MegatronFSDP.__init__ so both in Megatron-LM and native Torch, we can have the same initialization pattern:

# Construct toy model.
with te.pytorch.quantized_model_init(
    enabled=True,
    recipe=fp8_recipe,
    # Needed for FP8 parameters with Megatron-FSDP.
    preserve_high_precision_init_val=True,
):
    toy_model = ToyTETransformer(
        model_dim=DIM_SIZE,
        num_heads=2,
        num_layers=NUM_LAYERS,
        output_dim=DIM_SIZE,
        device="meta"
    )

# Fully-shard the model.
# NOTE: We do NOT need the quantized_model_init context manager for Megatron-FSDP,
# because it has already setup the correct state during the root module FP8 init, I believe?
mfsdp_model = fully_shard_model(
    module=toy_model,
    fsdp_unit_modules=[te.pytorch.TransformerLayer, te.pytorch.Linear],
    zero_dp_strategy=3,
    init_model_with_meta_device=True,
)

This should not break Megatron-LM right? (Testing...) I believe this also means we do not need an fp8_param_gather argument either!

cspades · 2026-01-07T21:21:57Z

/ok to test cba67e3

cspades · 2026-01-07T23:24:41Z

API compatibility check error is expected, just like in the DEV PR, with the same violations.

Signed-off-by: oliver könig <[email protected]>

kunlunl requested a review from a team as a code owner November 13, 2025 07:54

kunlunl mentioned this pull request Nov 13, 2025

[Dev] FP8 params support for megatron-fsdp (MXFP8/Blockwise) #2086

Merged

6 tasks

shjwudp changed the title ~~FP8 params support for megatron-fsdp~~ FP8 params support for megatron-fsdp (MXFP8/Blockwise) Dec 5, 2025

Skylion007 reviewed Dec 16, 2025

View reviewed changes

Phlip79 added the complexity: medium label Dec 17, 2025

BoxiangW added the module: megatron-fsdp label Dec 18, 2025

ericharper added Expert Review Apply this label to indicate that your PR is ready for expert review. module: megatron-fsdp and removed module: megatron-fsdp labels Dec 18, 2025

Phlip79 requested a review from BoxiangW December 23, 2025 01:53

Phlip79 added complexity: high and removed complexity: medium labels Dec 30, 2025

Phlip79 requested a review from cspades December 30, 2025 21:41

cspades requested review from a team as code owners December 31, 2025 22:44

copy-pr-bot bot temporarily deployed to nemo-ci January 1, 2026 20:07 Inactive

ko3n1g added this to the Core 0.16 milestone Jan 1, 2026

copy-pr-bot bot had a problem deploying to nemo-ci January 1, 2026 20:07 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 1, 2026 20:07 Inactive

copy-pr-bot bot temporarily deployed to test January 1, 2026 20:08 Inactive

FP8 params support for megatron-fsdp (MXFP8/Blockwise) (NVIDIA#2086)

feb6753

Signed-off-by: kunlunl <[email protected]> Co-authored-by: jianbinc <[email protected]>

kunlunl force-pushed the kunlunl/megatron-fsdp-fp8-params_main branch from 42a6bdc to feb6753 Compare January 5, 2026 07:21

BoxiangW reviewed Jan 5, 2026

View reviewed changes

Phlip79 added the dev2main: mbridge dev to main: this PR is needed in main for mbridge label Jan 5, 2026

shjwudp and others added 2 commits January 6, 2026 12:47

handle fp8_tensor _data is None situation

984912e

Signed-off-by: jianbinc <[email protected]>

Merge pull request #9 from shjwudp/megatron-fsdp-fp8-params_main

31624f7

handle fp8_tensor _data is None situation

This was referenced Jan 6, 2026

[dev] Kunlunl/megatron fsdp fp8 params #2826

Closed

[dev] Reapply fsdp mxfp8 #2828

Merged

kunlunl closed this Jan 6, 2026

kunlunl reopened this Jan 6, 2026

copy-pr-bot bot temporarily deployed to nemo-ci January 6, 2026 07:08 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 6, 2026 07:09 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 6, 2026 07:09 Inactive

copy-pr-bot bot temporarily deployed to test January 6, 2026 07:09 Inactive

cspades approved these changes Jan 7, 2026

View reviewed changes

Phlip79 added the Final Review Apply this label to indicate that your PR is ready for final review. label Jan 7, 2026

Merge branch 'main' into kunlunl/megatron-fsdp-fp8-params_main

cba67e3

Phlip79 requested review from ericharper and kvareddy January 7, 2026 21:22

copy-pr-bot bot temporarily deployed to nemo-ci January 7, 2026 21:22 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 7, 2026 21:22 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 7, 2026 21:22 Inactive

copy-pr-bot bot temporarily deployed to test January 7, 2026 21:22 Inactive

kvareddy approved these changes Jan 8, 2026

View reviewed changes

ko3n1g added a commit to ko3n1g/Megatron-LM that referenced this pull request Jan 8, 2026

FP8 params support for megatron-fsdp (MXFP8/Blockwise) NVIDIA#2239

a9df29b

Signed-off-by: oliver könig <[email protected]>

		return TE_VERSION > PkgVersion(vers)


		def is_float8tensor(tensor: torch.Tensor) -> bool:

FP8 params support for megatron-fsdp (MXFP8/Blockwise) #2239

Are you sure you want to change the base?

FP8 params support for megatron-fsdp (MXFP8/Blockwise) #2239

Conversation

kunlunl commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot bot commented Nov 13, 2025

Uh oh!

Skylion007 commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kunlunl commented Nov 24, 2025

Uh oh!

cspades commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skylion007 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

shjwudp commented Dec 18, 2025

Uh oh!

cspades commented Jan 1, 2026

Uh oh!

shjwudp commented Jan 5, 2026

Uh oh!

BoxiangW Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

BoxiangW Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

shjwudp commented Jan 6, 2026

Uh oh!

shjwudp commented Jan 6, 2026

Uh oh!

cspades left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cspades commented Jan 7, 2026

Uh oh!

cspades commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

kunlunl commented Nov 13, 2025 •

edited

Loading

(Step 1): Add PR label `Expert Review`

Skylion007 commented Nov 22, 2025 •

edited

Loading

cspades commented Dec 16, 2025 •

edited

Loading

cspades left a comment •

edited

Loading

cspades commented Jan 7, 2026 •

edited

Loading