feat(dllm): add DFlash and LLaDA2 SFT recipes by kashif · Pull Request #2315 · NVIDIA-NeMo/Automodel

kashif · 2026-05-25T20:36:04Z

What does this PR do?

Adds DFlash speculative decoding and LLaDA2 SFT recipes for diffusion language models.

Changelog

Add DFlashStrategy for DFlash speculative decoding training (draft model trains against frozen target LM hidden states with decay-weighted loss)
Add HybridStrategy + HybridDiffusionLLMLoss (from upstream main, co-existing with DFlash)
Add DFlashDecayLoss with position-decay cross-entropy and DLLMLossOutput
Add Nemotron-3-Nano-30B-A3B DFlash SFT training config (examples/dllm_sft/nemotron_nano30b_dflash.yaml)
Add scripts/create_nemotron_nano30b_dflash_draft.py for initializing 7-layer DFlash draft checkpoints
Update DLLM_STRATEGIES registry: {"mdlm": MDLMStrategy, "hybrid": HybridStrategy, "dflash": DFlashStrategy}
Fix _restore_loaded_dtype to only apply for NeMo model loading paths (not vanilla trust_remote_code models)
Resolve merge conflicts with upstream main (HybridStrategy additions)
Update docs and tests

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?

Additional Information

Replaces #2214 (closed due to accidental force-push that rewrote shared history).

Smoke test on 8×H100 confirmed: step 0 loss 5.3540, DFlash setup verified with nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 as target.

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

- Split dflash YAML section: add required dllm: section alongside dflash: (DiffusionLMSFTRecipe.setup() reads dllm:, DFlashSFTRecipe reads dflash:) - Auto-resolve mask_token_id by adding <|MASK|> special token to tokenizer when neither YAML nor base tokenizer (Qwen3) defines one - Smoke test: use allenai/tulu-3-sft-mixture[:64] (ChatDataset requires OpenAI-format messages; wikitext plain text didn't qualify) - Add standalone GPU test script (no torchao dependency) Smoke run: 2 steps, loss 20.4→28.3, grad_norm flowing, mem 12.65 GiB ✓ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

- main(): remove type hints from signature, rename recipe → trainer - _run_train_optim_step(): remove return type and batches type hints - _run_validation_epoch(): remove return type hint - log_train_metrics(): remove return type hint; switch % → .format(); add tps_per_gpu and mode to log line to match parent format - MetricsSample: add Train/mfu (None), Train/supervised_tokens, change hardcoded "dflash" → self.dllm_mode; allreduce num_predicted_tokens Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

Move all DFlash-specific training logic into DFlashStrategy so that DiffusionLMSFTRecipe handles all dLLM model families without subclassing. Strategy additions (strategy.py): - setup_extra(recipe): hook for loading auxiliary models; DFlashStrategy loads+freezes target LM and resolves mask_token_id for tokenizers that have none (e.g. Qwen3 → adds <|MASK|> special token) - pre_step(recipe, batches) → (noise_tokens, supervised_tokens): MDLM does corruption loop; DFlash does target forwards + offloads to CPU - forward_backward(recipe, idx, batch, ...): MDLM delegates to existing _forward_backward_step; DFlash implements anchor-block draft forward - loss_log_key property: "Loss/Train_DLLM" / "Loss/Train_DFlash" - _build_target_layer_ids, _sample_anchor_block, _run_target_forward moved from DFlashSFTRecipe into DFlashStrategy Recipe changes (train_ft.py): - setup(): defer mask_token_id raise to after setup_extra() call - _run_train_optim_step(): replace inline corruption loop with strategy.pre_step(); dispatch via strategy.forward_backward() - _run_validation_epoch(): same pre_step + forward_backward dispatch - log_train_metrics(): use strategy.loss_log_key instead of hardcoded key train_dflash.py reduced to a 6-line entry-point shim pointing at DiffusionLMSFTRecipe; DFlashSFTRecipe subclass removed entirely. Smoke tested: DFlash 2-step run passes (loss 20.4→28.3, grads flowing). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

- Remove nemo_automodel/recipes/dllm/train_dflash.py; use the existing examples/dllm_sft/finetune.py as the single entry point for all dLLM SFT modes including DFlash (pthombre) - Update dflash_sft.yaml: switch recipe key to DiffusionLMSFTRecipe, update usage comment to point at finetune.py - Pass num_tokens=num_diffusion_tokens to DFlashDecayLoss in DFlashStrategy.forward_backward so the loss is properly normalised across DP replicas and gradient-accumulation steps (claude review) - Fix test_dflash_sft_gpu.py: import and use the real DFlashDecayLoss instead of an inlined variant with different normalization; pass num_tokens so the test exercises the same code path as production; fix docstring ("three" -> "four" novel pieces) (claude review) Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

- Merge DFlashDecayLoss into dllm_loss.py; delete dflash_loss.py - DFlashStrategy: add _sample_anchor_blocks (stars-and-bars sampling of N non-overlapping anchors) and _build_block_attention_mask (sparse 4D mask where block b attends only to its own causal prefix and own noise positions) - DFlashDecayLoss: add block_size param so per-block decay resets at each block boundary for N>1 training - pre_step dispatches to multi-block path when num_blocks_per_sample > 1; position_ids use actual sequence positions for correct RoPE - dflash_sft.yaml: num_blocks_per_sample: 1 (explicit default) - dflash_smoke.yaml: num_blocks_per_sample: 4; fix recipe name Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

…model/mask token - AND block_mask with loss_mask in _sample_anchor_block and _sample_anchor_blocks so prompt tokens are not supervised during DFlash SFT - Pre_step now reads loss_mask from the batch and passes it through - Fix llada2_sft.yaml: correct model to inclusionAI/LLaDA2.1-mini and update mask_token_id from 126336 to 156895 (<|mask|> in Qwen tokenizer) - Update docs: note different mask_token_id values for LLaDA vs LLaDA2.1 - Add test_loss_mask_zeros_block_mask to cover the new loss_mask path Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

…draft init script - Use `dtype=` instead of deprecated `torch_dtype=` in AutoModelForCausalLM.from_pretrained - Add `trust_remote_code=True` to support Nemotron-H hybrid SSM target models - Add scripts/create_nemotron_nano30b_dflash_draft.py to initialise a 7-layer DFlash draft (hidden=2688, 21 Q-heads, 3 KV-heads) for NVIDIA-Nemotron-3-Nano-30B-A3B Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

7-layer draft (hidden=2688, 21 Q-heads, 3 KV-heads, head_dim=128) conditioned on frozen NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 target, trained on Nemotron-Post-Training-Dataset-v2 chat+math+code mix for 3 epochs. Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

…gside DFlash - Add HybridDiffusionLLMLoss and HybridStrategy (Nemotron-Labs-Diffusion) from main - Import corrupt_blockwise alongside corrupt_uniform in strategy.py - Register "hybrid" in DLLM_STRATEGIES alongside "mdlm" and "dflash" - Update docs and tests to cover all three strategies Signed-off-by: Kashif Rasul <kashif@huggingface.co> Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

train_ft.py now passes num_ar_tokens= to forward_backward for all strategies (added in the Nemotron-Labs-Diffusion hybrid support). DFlashStrategy and the DLLMStrategy base class both need it in their signatures to avoid a TypeError. DFlash itself does not use AR tokens but accepts the kwarg for API compatibility. Signed-off-by: Kashif Rasul <kashif@huggingface.co> Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

The Nemotron-Labs-Diffusion hybrid commit sets _restore_loaded_dtype=False on the model config to prevent NeMo's loader from silently downcasting master weights. However, when the model target is a vanilla transformers class (e.g. DFlash using AutoModel.from_pretrained with trust_remote_code), this internal flag gets passed to the model __init__ and raises TypeError. Guard the flag to NeMo model targets only. Signed-off-by: Kashif Rasul <kashif@huggingface.co> Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

pthombre · 2026-05-27T05:50:09Z

/ok to test 0f14f71

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

…odel into add-dllm-pipelines

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

…once Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

Draw anchor blocks independently per sample ([B, N]) instead of sharing one set across the batch, build a genuine per-sample block_keep_mask, and drop samples with fewer than 2*block_size supervised tokens. Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

Add a DP/CP-correct draft top-1 accuracy metric (per-rank correct/global tokens, SUM-allreduced) computed for free in the chunked linear-CE loop, and prune isinstance/shape-only/constant tests from the dLLM suite. Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

kashif added 27 commits May 12, 2026 07:24

feat(dllm): add DFlash and LLaDA2 SFT recipes

19e712d

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

remove smoke yaml changes from PR

fef8f17

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

remove example yamls from PR (internal development only)

6151b1f

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

re-add dflash_sft.yaml and llada2_sft.yaml

3d7d2d0

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

remove test_dflash_sft_gpu.py from PR (internal smoke test)

85f726b

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

add DFlash section to dLLM finetune docs; re-add GPU smoke test

89e0a01

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

add DFlash unit tests; remove GPU smoke script from PR

83be478

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

clean up DFlash unit tests: style, comments, redundant checks

9c7bf35

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

ruff format

fc12598

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

fix DFlashDecayLoss: del logits after NLL, use token_nll.device

b70b84c

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

Update docs/guides/dllm/finetune.md

9d13d88

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

Update docs/guides/dllm/finetune.md

44dac53

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

Update docs/guides/dllm/finetune.md

680a6ff

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

Update docs/guides/dllm/finetune.md

63b612b

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

Merge branch 'main' into add-dllm-pipelines

8d5045f

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

Merge branch 'main' into add-dllm-pipelines

8c330cb

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

kashif requested review from HuiyingLi, adil-a and akoumpa as code owners May 25, 2026 20:36

svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 26, 2026

copy-pr-bot Bot temporarily deployed to test May 27, 2026 05:50 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 27, 2026 05:50 Inactive

copy-pr-bot Bot temporarily deployed to public May 27, 2026 05:50 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 27, 2026 05:56 Inactive

fix(lint): address dflash lint failures

a816d2f

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

copy-pr-bot Bot temporarily deployed to public May 27, 2026 05:58 Inactive

copy-pr-bot Bot temporarily deployed to public May 27, 2026 06:05 Inactive

kashif added 12 commits May 27, 2026 15:34

Merge branch 'main' into add-dflash-pipeline

85ad9c4

fix(dflash): chunked linear-CE keeps N=512 within memory under FSDP2

44704d5

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

Merge branch 'add-dflash-pipeline' of https://github.com/kashif/Autom…

7f110da

…odel into add-dllm-pipelines

feat(dflash): expose ce_chunk_size config for chunked linear-CE

61283fc

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

perf(dflash): fix flex-attention shape to compile once, not per step

af439dc

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

perf(dflash): pad context to fixed length so flex-attention compiles …

9a44ec6

…once Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

chore(dflash): train 6 epochs (paper Appendix A.1)

e9d971d

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

refactor(dllm): bare metric keys; per-position draft accuracy

58864fb

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

Merge branch 'main' into add-dflash-pipeline

e10982e

chore(dllm): drop flex_attention defensive checks; doc updates

eb2e352

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dllm): add DFlash and LLaDA2 SFT recipes#2315

feat(dllm): add DFlash and LLaDA2 SFT recipes#2315
kashif wants to merge 50 commits into
NVIDIA-NeMo:mainfrom
kashif:add-dflash-pipeline

kashif commented May 25, 2026

Uh oh!

pthombre commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kashif commented May 25, 2026

What does this PR do?

Changelog

Before your PR is "Ready for review"

Additional Information

Uh oh!

pthombre commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants