Skip to content

feat(dllm): add DFlash and LLaDA2 SFT recipes#2315

Open
kashif wants to merge 50 commits into
NVIDIA-NeMo:mainfrom
kashif:add-dflash-pipeline
Open

feat(dllm): add DFlash and LLaDA2 SFT recipes#2315
kashif wants to merge 50 commits into
NVIDIA-NeMo:mainfrom
kashif:add-dflash-pipeline

Conversation

@kashif
Copy link
Copy Markdown

@kashif kashif commented May 25, 2026

What does this PR do?

Adds DFlash speculative decoding and LLaDA2 SFT recipes for diffusion language models.

Changelog

  • Add DFlashStrategy for DFlash speculative decoding training (draft model trains against frozen target LM hidden states with decay-weighted loss)
  • Add HybridStrategy + HybridDiffusionLLMLoss (from upstream main, co-existing with DFlash)
  • Add DFlashDecayLoss with position-decay cross-entropy and DLLMLossOutput
  • Add Nemotron-3-Nano-30B-A3B DFlash SFT training config (examples/dllm_sft/nemotron_nano30b_dflash.yaml)
  • Add scripts/create_nemotron_nano30b_dflash_draft.py for initializing 7-layer DFlash draft checkpoints
  • Update DLLM_STRATEGIES registry: {"mdlm": MDLMStrategy, "hybrid": HybridStrategy, "dflash": DFlashStrategy}
  • Fix _restore_loaded_dtype to only apply for NeMo model loading paths (not vanilla trust_remote_code models)
  • Resolve merge conflicts with upstream main (HybridStrategy additions)
  • Update docs and tests

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

Additional Information

Replaces #2214 (closed due to accidental force-push that rewrote shared history).

Smoke test on 8×H100 confirmed: step 0 loss 5.3540, DFlash setup verified with nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 as target.

kashif added 27 commits May 12, 2026 07:24
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
- Split dflash YAML section: add required dllm: section alongside dflash:
  (DiffusionLMSFTRecipe.setup() reads dllm:, DFlashSFTRecipe reads dflash:)
- Auto-resolve mask_token_id by adding <|MASK|> special token to tokenizer
  when neither YAML nor base tokenizer (Qwen3) defines one
- Smoke test: use allenai/tulu-3-sft-mixture[:64] (ChatDataset requires
  OpenAI-format messages; wikitext plain text didn't qualify)
- Add standalone GPU test script (no torchao dependency)

Smoke run: 2 steps, loss 20.4→28.3, grad_norm flowing, mem 12.65 GiB ✓

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
- main(): remove type hints from signature, rename recipe → trainer
- _run_train_optim_step(): remove return type and batches type hints
- _run_validation_epoch(): remove return type hint
- log_train_metrics(): remove return type hint; switch % → .format();
  add tps_per_gpu and mode to log line to match parent format
- MetricsSample: add Train/mfu (None), Train/supervised_tokens,
  change hardcoded "dflash" → self.dllm_mode; allreduce num_predicted_tokens

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Move all DFlash-specific training logic into DFlashStrategy so that
DiffusionLMSFTRecipe handles all dLLM model families without subclassing.

Strategy additions (strategy.py):
- setup_extra(recipe): hook for loading auxiliary models; DFlashStrategy
  loads+freezes target LM and resolves mask_token_id for tokenizers that
  have none (e.g. Qwen3 → adds <|MASK|> special token)
- pre_step(recipe, batches) → (noise_tokens, supervised_tokens):
  MDLM does corruption loop; DFlash does target forwards + offloads to CPU
- forward_backward(recipe, idx, batch, ...): MDLM delegates to existing
  _forward_backward_step; DFlash implements anchor-block draft forward
- loss_log_key property: "Loss/Train_DLLM" / "Loss/Train_DFlash"
- _build_target_layer_ids, _sample_anchor_block, _run_target_forward
  moved from DFlashSFTRecipe into DFlashStrategy

Recipe changes (train_ft.py):
- setup(): defer mask_token_id raise to after setup_extra() call
- _run_train_optim_step(): replace inline corruption loop with
  strategy.pre_step(); dispatch via strategy.forward_backward()
- _run_validation_epoch(): same pre_step + forward_backward dispatch
- log_train_metrics(): use strategy.loss_log_key instead of hardcoded key

train_dflash.py reduced to a 6-line entry-point shim pointing at
DiffusionLMSFTRecipe; DFlashSFTRecipe subclass removed entirely.

Smoke tested: DFlash 2-step run passes (loss 20.4→28.3, grads flowing).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
- Remove nemo_automodel/recipes/dllm/train_dflash.py; use the existing
  examples/dllm_sft/finetune.py as the single entry point for all dLLM
  SFT modes including DFlash (pthombre)
- Update dflash_sft.yaml: switch recipe key to DiffusionLMSFTRecipe,
  update usage comment to point at finetune.py
- Pass num_tokens=num_diffusion_tokens to DFlashDecayLoss in
  DFlashStrategy.forward_backward so the loss is properly normalised
  across DP replicas and gradient-accumulation steps (claude review)
- Fix test_dflash_sft_gpu.py: import and use the real DFlashDecayLoss
  instead of an inlined variant with different normalization; pass
  num_tokens so the test exercises the same code path as production;
  fix docstring ("three" -> "four" novel pieces) (claude review)

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
- Merge DFlashDecayLoss into dllm_loss.py; delete dflash_loss.py
- DFlashStrategy: add _sample_anchor_blocks (stars-and-bars sampling of N
  non-overlapping anchors) and _build_block_attention_mask (sparse 4D mask
  where block b attends only to its own causal prefix and own noise positions)
- DFlashDecayLoss: add block_size param so per-block decay resets at each
  block boundary for N>1 training
- pre_step dispatches to multi-block path when num_blocks_per_sample > 1;
  position_ids use actual sequence positions for correct RoPE
- dflash_sft.yaml: num_blocks_per_sample: 1 (explicit default)
- dflash_smoke.yaml: num_blocks_per_sample: 4; fix recipe name

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
…model/mask token

- AND block_mask with loss_mask in _sample_anchor_block and _sample_anchor_blocks
  so prompt tokens are not supervised during DFlash SFT
- Pre_step now reads loss_mask from the batch and passes it through
- Fix llada2_sft.yaml: correct model to inclusionAI/LLaDA2.1-mini and
  update mask_token_id from 126336 to 156895 (<|mask|> in Qwen tokenizer)
- Update docs: note different mask_token_id values for LLaDA vs LLaDA2.1
- Add test_loss_mask_zeros_block_mask to cover the new loss_mask path

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
…draft init script

- Use `dtype=` instead of deprecated `torch_dtype=` in AutoModelForCausalLM.from_pretrained
- Add `trust_remote_code=True` to support Nemotron-H hybrid SSM target models
- Add scripts/create_nemotron_nano30b_dflash_draft.py to initialise a 7-layer
  DFlash draft (hidden=2688, 21 Q-heads, 3 KV-heads) for NVIDIA-Nemotron-3-Nano-30B-A3B

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
7-layer draft (hidden=2688, 21 Q-heads, 3 KV-heads, head_dim=128) conditioned
on frozen NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 target, trained on
Nemotron-Post-Training-Dataset-v2 chat+math+code mix for 3 epochs.

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
…gside DFlash

- Add HybridDiffusionLLMLoss and HybridStrategy (Nemotron-Labs-Diffusion) from main
- Import corrupt_blockwise alongside corrupt_uniform in strategy.py
- Register "hybrid" in DLLM_STRATEGIES alongside "mdlm" and "dflash"
- Update docs and tests to cover all three strategies

Signed-off-by: Kashif Rasul <kashif@huggingface.co>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
train_ft.py now passes num_ar_tokens= to forward_backward for all strategies
(added in the Nemotron-Labs-Diffusion hybrid support). DFlashStrategy and the
DLLMStrategy base class both need it in their signatures to avoid a TypeError.
DFlash itself does not use AR tokens but accepts the kwarg for API compatibility.

Signed-off-by: Kashif Rasul <kashif@huggingface.co>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
The Nemotron-Labs-Diffusion hybrid commit sets _restore_loaded_dtype=False
on the model config to prevent NeMo's loader from silently downcasting
master weights. However, when the model target is a vanilla transformers
class (e.g. DFlash using AutoModel.from_pretrained with trust_remote_code),
this internal flag gets passed to the model __init__ and raises TypeError.

Guard the flag to NeMo model targets only.

Signed-off-by: Kashif Rasul <kashif@huggingface.co>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 26, 2026
@pthombre
Copy link
Copy Markdown
Contributor

/ok to test 0f14f71

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
kashif added 12 commits May 27, 2026 15:34
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
…once

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Draw anchor blocks independently per sample ([B, N]) instead of sharing
one set across the batch, build a genuine per-sample block_keep_mask, and
drop samples with fewer than 2*block_size supervised tokens.

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Add a DP/CP-correct draft top-1 accuracy metric (per-rank correct/global
tokens, SUM-allreduced) computed for free in the chunked linear-CE loop,
and prune isinstance/shape-only/constant tests from the dLLM suite.

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-customer Waiting on the original author to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants