feat(dllm): add DFlash and LLaDA2 SFT recipes#2315
Open
kashif wants to merge 50 commits into
Open
Conversation
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
- Split dflash YAML section: add required dllm: section alongside dflash: (DiffusionLMSFTRecipe.setup() reads dllm:, DFlashSFTRecipe reads dflash:) - Auto-resolve mask_token_id by adding <|MASK|> special token to tokenizer when neither YAML nor base tokenizer (Qwen3) defines one - Smoke test: use allenai/tulu-3-sft-mixture[:64] (ChatDataset requires OpenAI-format messages; wikitext plain text didn't qualify) - Add standalone GPU test script (no torchao dependency) Smoke run: 2 steps, loss 20.4→28.3, grad_norm flowing, mem 12.65 GiB ✓ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
- main(): remove type hints from signature, rename recipe → trainer - _run_train_optim_step(): remove return type and batches type hints - _run_validation_epoch(): remove return type hint - log_train_metrics(): remove return type hint; switch % → .format(); add tps_per_gpu and mode to log line to match parent format - MetricsSample: add Train/mfu (None), Train/supervised_tokens, change hardcoded "dflash" → self.dllm_mode; allreduce num_predicted_tokens Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Move all DFlash-specific training logic into DFlashStrategy so that DiffusionLMSFTRecipe handles all dLLM model families without subclassing. Strategy additions (strategy.py): - setup_extra(recipe): hook for loading auxiliary models; DFlashStrategy loads+freezes target LM and resolves mask_token_id for tokenizers that have none (e.g. Qwen3 → adds <|MASK|> special token) - pre_step(recipe, batches) → (noise_tokens, supervised_tokens): MDLM does corruption loop; DFlash does target forwards + offloads to CPU - forward_backward(recipe, idx, batch, ...): MDLM delegates to existing _forward_backward_step; DFlash implements anchor-block draft forward - loss_log_key property: "Loss/Train_DLLM" / "Loss/Train_DFlash" - _build_target_layer_ids, _sample_anchor_block, _run_target_forward moved from DFlashSFTRecipe into DFlashStrategy Recipe changes (train_ft.py): - setup(): defer mask_token_id raise to after setup_extra() call - _run_train_optim_step(): replace inline corruption loop with strategy.pre_step(); dispatch via strategy.forward_backward() - _run_validation_epoch(): same pre_step + forward_backward dispatch - log_train_metrics(): use strategy.loss_log_key instead of hardcoded key train_dflash.py reduced to a 6-line entry-point shim pointing at DiffusionLMSFTRecipe; DFlashSFTRecipe subclass removed entirely. Smoke tested: DFlash 2-step run passes (loss 20.4→28.3, grads flowing). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
- Remove nemo_automodel/recipes/dllm/train_dflash.py; use the existing
examples/dllm_sft/finetune.py as the single entry point for all dLLM
SFT modes including DFlash (pthombre)
- Update dflash_sft.yaml: switch recipe key to DiffusionLMSFTRecipe,
update usage comment to point at finetune.py
- Pass num_tokens=num_diffusion_tokens to DFlashDecayLoss in
DFlashStrategy.forward_backward so the loss is properly normalised
across DP replicas and gradient-accumulation steps (claude review)
- Fix test_dflash_sft_gpu.py: import and use the real DFlashDecayLoss
instead of an inlined variant with different normalization; pass
num_tokens so the test exercises the same code path as production;
fix docstring ("three" -> "four" novel pieces) (claude review)
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
- Merge DFlashDecayLoss into dllm_loss.py; delete dflash_loss.py - DFlashStrategy: add _sample_anchor_blocks (stars-and-bars sampling of N non-overlapping anchors) and _build_block_attention_mask (sparse 4D mask where block b attends only to its own causal prefix and own noise positions) - DFlashDecayLoss: add block_size param so per-block decay resets at each block boundary for N>1 training - pre_step dispatches to multi-block path when num_blocks_per_sample > 1; position_ids use actual sequence positions for correct RoPE - dflash_sft.yaml: num_blocks_per_sample: 1 (explicit default) - dflash_smoke.yaml: num_blocks_per_sample: 4; fix recipe name Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
…model/mask token - AND block_mask with loss_mask in _sample_anchor_block and _sample_anchor_blocks so prompt tokens are not supervised during DFlash SFT - Pre_step now reads loss_mask from the batch and passes it through - Fix llada2_sft.yaml: correct model to inclusionAI/LLaDA2.1-mini and update mask_token_id from 126336 to 156895 (<|mask|> in Qwen tokenizer) - Update docs: note different mask_token_id values for LLaDA vs LLaDA2.1 - Add test_loss_mask_zeros_block_mask to cover the new loss_mask path Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
…draft init script - Use `dtype=` instead of deprecated `torch_dtype=` in AutoModelForCausalLM.from_pretrained - Add `trust_remote_code=True` to support Nemotron-H hybrid SSM target models - Add scripts/create_nemotron_nano30b_dflash_draft.py to initialise a 7-layer DFlash draft (hidden=2688, 21 Q-heads, 3 KV-heads) for NVIDIA-Nemotron-3-Nano-30B-A3B Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
7-layer draft (hidden=2688, 21 Q-heads, 3 KV-heads, head_dim=128) conditioned on frozen NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 target, trained on Nemotron-Post-Training-Dataset-v2 chat+math+code mix for 3 epochs. Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
…gside DFlash - Add HybridDiffusionLLMLoss and HybridStrategy (Nemotron-Labs-Diffusion) from main - Import corrupt_blockwise alongside corrupt_uniform in strategy.py - Register "hybrid" in DLLM_STRATEGIES alongside "mdlm" and "dflash" - Update docs and tests to cover all three strategies Signed-off-by: Kashif Rasul <kashif@huggingface.co> Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
train_ft.py now passes num_ar_tokens= to forward_backward for all strategies (added in the Nemotron-Labs-Diffusion hybrid support). DFlashStrategy and the DLLMStrategy base class both need it in their signatures to avoid a TypeError. DFlash itself does not use AR tokens but accepts the kwarg for API compatibility. Signed-off-by: Kashif Rasul <kashif@huggingface.co> Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
The Nemotron-Labs-Diffusion hybrid commit sets _restore_loaded_dtype=False on the model config to prevent NeMo's loader from silently downcasting master weights. However, when the model target is a vanilla transformers class (e.g. DFlash using AutoModel.from_pretrained with trust_remote_code), this internal flag gets passed to the model __init__ and raises TypeError. Guard the flag to NeMo model targets only. Signed-off-by: Kashif Rasul <kashif@huggingface.co> Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Contributor
|
/ok to test 0f14f71 |
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
…odel into add-dllm-pipelines
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
…once Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Draw anchor blocks independently per sample ([B, N]) instead of sharing one set across the batch, build a genuine per-sample block_keep_mask, and drop samples with fewer than 2*block_size supervised tokens. Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Add a DP/CP-correct draft top-1 accuracy metric (per-rank correct/global tokens, SUM-allreduced) computed for free in the chunked linear-CE loop, and prune isinstance/shape-only/constant tests from the dLLM suite. Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds DFlash speculative decoding and LLaDA2 SFT recipes for diffusion language models.
Changelog
DFlashStrategyfor DFlash speculative decoding training (draft model trains against frozen target LM hidden states with decay-weighted loss)HybridStrategy+HybridDiffusionLLMLoss(from upstream main, co-existing with DFlash)DFlashDecayLosswith position-decay cross-entropy andDLLMLossOutputexamples/dllm_sft/nemotron_nano30b_dflash.yaml)scripts/create_nemotron_nano30b_dflash_draft.pyfor initializing 7-layer DFlash draft checkpointsDLLM_STRATEGIESregistry:{"mdlm": MDLMStrategy, "hybrid": HybridStrategy, "dflash": DFlashStrategy}_restore_loaded_dtypeto only apply for NeMo model loading paths (not vanillatrust_remote_codemodels)Before your PR is "Ready for review"
Pre checks:
Additional Information
Replaces #2214 (closed due to accidental force-push that rewrote shared history).
Smoke test on 8×H100 confirmed: step 0
loss 5.3540, DFlash setup verified withnvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16as target.