feat: adding PP and CP for nemotron v3 models#2316
Draft
adil-a wants to merge 17 commits into
Draft
Conversation
… fix Source-only snapshot for adil/adil-test (YAMLs intentionally excluded). - NemotronV3 pipeline-parallel + MTP wiring (model.py, mtp.py, layers.py, state_dict_adapter.py, common/mtp/mtp.py). - Loss path: PipelineCausalLMLoss, calculate_mtp_loss with cu_seqlens / seq_idx threading, MoEAuxLossAutoScaler-aware grad scaling in train_ft. - Attention preprocess threading for THD + CP (attention/utils.py). - MoE parallelizer: iterate transformer and MTP blocks together for CP hook attach and FSDP wrapping. THD collator fix: - process_input_for_thd: gate trailing-pack-pad absorption on trailing_pad <= max_seqlen; for larger pads, extend cu_seqlens with dummy max_seqlen-sized slots and regenerate position_ids per-slot. - split_batch_into_thd_chunks: emit cu_seqlens_padded whenever any chunk needs it (any-vs-all), absorbed chunks fall back to cu_seqlens. - NemotronV3Model.forward and NemotronHForCausalLM MTP-kwargs: strip -1000 sentinel right-pad from cu_seqlens / cu_seqlens_padded before they reach TE attention / mamba searchsorted. Tests: - tests/unit_tests/distributed/test_thd_utils.py: dummy-slot extension coverage, mixed-chunk short+full coverage; updated existing tests to reflect cu_seqlens emit semantics (real cumsum, not padded). - tests/unit_tests/models/nemotron_v3/test_nemotron_v3_mtp.py + test_nemotron_v3_pp_mtp.py: list-form mtp_layers_block_type resolution, PP-stage metas, MTP placement on last stage, sentinel filter on trimmed mid-stages. Dataset: - nemo_automodel/components/datasets/llm/pre_rendered_chat_dataset.py: new dataset for pre-rendered chat data. Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Not used by the THD/PP/MTP changes on this branch; keep it out of the remote until it ships with its own dedicated PR. Signed-off-by: adil-a <adil.asif2000@hotmail.com>
…contract The THD collator's trailing-pad absorption grows ``cu_seqlens[-1]`` to ``total_tokens`` so TE can skip the ``pad_between_seqs=True`` path. But ``max_seqlen`` was computed BEFORE the absorption from the unpadded ``seq_lens`` — so we handed TE a ``cu_seqlens`` containing a slot wider than the ``max_seqlen`` value we claimed. That is explicit UB per TE's documented contract (``fused_attn.py:152-159``, ``fused_attn.h:548-551``: ``max_seqlen_q ... may be >= max(seqlen_q_i)``; smaller-than-actual is undefined). cuDNN-fused-attn-bwd happened not to crash on the small-overshoot case (128 vs 112) but did crash on the larger captured "short microbatch" (576 vs 112). Fix: defer the ``max_seqlen`` computation to after the absorption block and source it from the FINAL ``cu_seqlens`` slot widths. With the value now truthful, absorption is contract-clean for any trailing-pad size, so the previous defensive dummy-slot extension is no longer needed. Verified with a single-GPU TE probe (te_thd_repro_MINIMAL.py): feeding ``cu_seqlens=[0,112,224,336,448,1024]`` with truthful ``max_seqlen=576`` runs cleanly; the same layout with the lied ``max_seqlen=112`` still crashes. Smoke tests pass at PP=4 EP=2 CP=1 and PP=4 EP=2 CP=2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
``cu_seqlens`` and ``cu_seqlens_padded`` are constructed inside nested ``if seq_lens is not None``/``if seq_lens_padded is not None`` blocks, so ``cu_seqlens_padded`` can only be non-None when ``cu_seqlens`` is also non-None — the ``else cu_seqlens_padded`` fallback is unreachable. Falling back to the padded variant would also be semantically wrong (memory offsets vs. compute extent, per TE's docs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
…tics After the trailing-pad absorption + max_seqlen fix, ``cu_seqlens`` is emitted from ``seq_lens`` (real lengths), not ``seq_lens_padded``. The docstrings still said the opposite and the example output value was wrong. Also document the optional ``cu_seqlens_padded`` and ``max_seqlen`` keys that the function emits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Keep the comments that explain WHY (TE contract, the absorption gate, the pad_between_seqs flip on cu_seqlens_padded presence). Drop the ones that just restate WHAT the surrounding code already shows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
The cu_seqlens → seq_idx derivation used the default ``side="left"`` for
``torch.searchsorted``. By that contract a position equal to a boundary
(``t == cu_seqlens[k]`` — the FIRST token of sub-seq k) gets classified
as sub-seq k-1 instead of k, because cu_seqlens[1:][k-1] = cu_seqlens[k]
is ≥ t and side="left" picks the leftmost insertion point. That is an
off-by-one at every internal sub-seq boundary.
Two production sites were affected:
* ``loss/mtp.py:calculate_mtp_loss`` — derives seq_idx from cu_seqlens
to mask MTP labels whose rolled source crosses a sub-seq boundary.
Without right=True, the boundary token's mask is wrong.
* ``models/nemotron_v3/layers.py:NemotronV3Mamba2Mixer.forward`` —
derives seq_idx for mamba's SSD scan state-reset. Without right=True,
the state reset happens one token late at every boundary, so the
first token of a new sub-seq sees the previous sub-seq's accumulated
state.
Fix is a one-line change at each site: add ``right=True`` to
``torch.searchsorted``.
Tests:
* tests/unit_tests/loss/test_mtp_cross_boundary.py — 7 tests pinning
the MTP cross-boundary mask behavior against a hand-coded reference.
* tests/unit_tests/distributed/test_seq_idx_from_cu_seqlens.py — 34
tests covering deterministic edge cases, randomized exhaustive
verification against a brute-force linear scan, production-site
pinning, and a static-analysis guard so reverts to side="left" fail.
Smoke verification: 8-GPU PP=4 EP=2 CP=1 hellaswag smoke produces a
loss trajectory bit-identical to pre-fix (hellaswag's pad-to-max-length
data has long pad runs between sub-seqs that absorb the bug's effect
on this dataset; the fix matters more for tightly-packed data).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Drops the env-var-gated instrumentation added during the
nemotron_v3 PP+MTP + THD-collator bug hunt:
* NEMO_PP_PACK_DEBUG — per-microbatch packing/PP info prints from
train_ft.py, nemotron_v3/{model,layers}.py, and common/mtp/mtp.py.
* NEMO_CU_DEBUG — cu_seqlens print inside the model forward.
* NEMO_DUMP_BATCH_DIR — per-step per-rank batch tensor dump for
offline replay.
* NEMO_REPLAY_BATCH_DIR / NEMO_REPLAY_BATCH_STEP /
NEMO_REPLAY_OVERRIDE_STEP / NEMO_REPLAY_OVERRIDE_WHAT — offline
batch replay + mixed-field override hooks.
* NEMO_COPY_MB0_TO_ALL — duplicate microbatch 0 across all
microbatch slots.
Also cleans up the few helper variables (``_seq_idx_source``,
``cu_seqlens_source``) that only existed to populate those prints,
and removes the now-unused ``import os`` from layers.py and model.py.
Net: -291 LOC across 4 files. 8-GPU PP=4 EP=2 CP=1 hellaswag smoke
is bit-identical to pre-cleanup (10/10 steps, val 8.9692).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
…ition_ids
At MTP depth k, the slot at position t embeds the token originally at
position ``t + k + 1``. That is implemented in ``MTPModule.forward``
by cumulatively rolling ``cur_input_ids`` and ``cur_position_ids`` left
by one at each depth iteration. The label-side rolling on the loss path
is already covered by ``tests/unit_tests/loss/test_mtp_cross_boundary.py``;
this adds the matching coverage on the model side.
Six new tests using a recording sublayer to intercept what reaches the
inner block at each depth:
* test_cumulative_input_ids_rolling_1d / _position_ids_rolling_1d /
_rolling_2d_batch — depth k receives inputs rolled by -(k+1).
* test_multi_sublayer_per_depth_sees_same_rolled_inputs — when
pattern_length > 1, all sublayers of a depth share the same rolled
inputs (no further intra-depth rolling); embed_input is only on
sublayer 0.
* test_precomputed_embed_inputs_path_skips_token_rolling — when the
caller pre-computes embeddings (PP final-stage path), the embed
rolling is skipped but position_ids still rolls.
* test_trailing_positions_zero_after_roll — roll is left-shift with
zeros, not wrap-around; the last k+1 slots are 0 at depth k.
CPU-only, no GPU required, ~4s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Keep the WHY notes that explain non-obvious behavior (the ``right=True`` boundary semantics for searchsorted, the PP-chunking shape contract, and the cross-sub-seq mask rationale). Drop the WHAT-narrating comments — the surrounding code already says what it does. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
…forward Two comment blocks in ``NemotronV3Model.forward`` and the MTP-kwargs path claimed PyTorch PP could not chunk ``cu_seqlens`` and ``broadcasts kwargs unchunked``. That was true earlier when ``cu_seqlens`` flowed as 1D ``[num_seqs+1]``; after ``split_batch_into_thd_chunks``, ``cu_seqlens`` is emitted stacked as ``[num_microbatches, max_K]`` and PP ``tensor_split``s along dim 0, yielding ``[1, max_K]`` per microbatch. The normalization code (squeeze + sentinel strip + max_seqlen flatten) is unchanged — the shapes it handles are exactly what PP produces today; only the comments were misdescribing the mechanism. Also reframes the cu_seqlens rebuild-from-attention_mask block as the non-THD-collater path (which is the real reason it exists) rather than the now-incorrect "PP can't chunk" rationale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Keep WHY-explaining notes (PP tuple arity contract, sentinel-strip rationale, FSDP2 lazy-init constraint, CP context forwarding rationale, THD-vs-BSHD backend gating). Drop WHAT-narrating prose that just restates what the next few lines do. Net -106 LOC in model.py. Source behavior unchanged; verified by ``pytest tests/unit_tests/models/nemotron_v3/`` (3 pre-existing ``seq_idx``-tail arity test failures unchanged from prior runs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Third instance of the cu_seqlens → seq_idx off-by-one. The PP last-stage seq_idx tail (used by ``PipelineCausalLMLoss`` for the MTP cross-boundary mask) derived per-token sub-seq ids via ``searchsorted`` with the default ``side="left"``, which mis-classifies every internal boundary token (``position == cu_seqlens[k]``) as belonging to sub-seq ``k-1`` instead of ``k``. The loss function's own ``right=True`` fix only protects its cu_seqlens-derivation fallback (mtp.py:87); when an explicit ``seq_idx`` is supplied (the PP path), the loss uses it verbatim. So this builder was the active source of off-by-one masks at sub-seq boundaries for THD+MTP+PP training. Matches the prior fixes in ``loss/mtp.py`` and ``models/nemotron_v3/layers.py``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
…alLM.forward The trim pass dropped the per-parameter Args section that's on main. Restore it (with the new ``*mtp_embed_inputs`` positional plus the PP-aware tuple Returns), keep the existing high-level summary and PP-awareness paragraph from the recent trim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
…ocstring
Two small corrections:
* The summary paragraph claimed the last-PP-stage return was
``(logits, *mtp_per_depth_h)``, but with MTP enabled the actual return
appends a ``seq_idx`` tail (``1 + D + 1`` arity), as the Returns block
just below already states. Defer to the Returns section instead of
duplicating (and contradicting) it.
* The THD-squeeze sentence over-claimed: the squeeze only happens when
the attention backend is TE; SDPA / flex paths stay in BSHD even with
``qkv_format='thd'``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Drop the multi-line preamble explaining why we iterate both backbone and MTP attention blocks — ``_iter_transformer_and_mtp_blocks`` is named self-descriptively. The non-obvious caveat (uninitialized CP state → illegal-memory-access) belongs in the helper's docstring, not at every call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.