Skip to content

feat: make mesh accept meshcontext#2266

Open
adil-a wants to merge 14 commits into
mainfrom
akoumpa/refactor_auto_class_public_api
Open

feat: make mesh accept meshcontext#2266
adil-a wants to merge 14 commits into
mainfrom
akoumpa/refactor_auto_class_public_api

Conversation

@adil-a
Copy link
Copy Markdown
Collaborator

@adil-a adil-a commented May 18, 2026

What does this PR do ?

Refactors the distributed public API around MeshContext so users can initialize distributed once, create a mesh context, and pass it directly to NeMoAutoModelForCausalLM.from_pretrained.

Changelog

  • Add/standardize create_mesh_context as the component-layer API that returns a MeshContext.
  • Rename the recipe YAML adapter from _dist_setup.setup_distributed to _dist_utils.create_mesh_context_from_config.
  • Update NeMoAutoModel*, recipes, diffusion pipeline, docs, and tests to use mesh-context naming.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

Validation:

  • uv run ruff format ...
  • uv run ruff check --fix ...
  • uv run pytest tests/unit_tests/distributed/test_mesh_utils.py tests/unit_tests/distributed/test_device_mesh.py tests/unit_tests/recipes/test_dist_utils.py tests/unit_tests/recipes/test_diffusion_train_metrics.py tests/unit_tests/_diffusers/test_auto_diffusion_pipeline.py -q
  • Targeted recipe setup tests passed.

Note: running the full test_train_ft.py and test_finetune_vlm_helpers.py files hit unrelated optional cut_cross_entropy CUDA fused-CE failures in this environment.

Additional Information

Related to distributed public API cleanup.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@adil-a
Copy link
Copy Markdown
Collaborator Author

adil-a commented May 18, 2026

/ok to test 3dcadfb

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented May 18, 2026

/ok to test a8b2df6

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented May 19, 2026

/ok to test a4876ae

@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented May 20, 2026

/ok to test d836169

Copy link
Copy Markdown
Contributor

@jgerh jgerh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed tech pubs review of docs/guides/gradient-checkpointing.md. No changes needed. LGTM.

@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented May 26, 2026

/ok to test 010ddc8

Two leftover references to the old setup_distributed/dist_setup API were
missed when the recipe was migrated to create_mesh_context_from_config:

- nemo_automodel/recipes/vlm/finetune.py:794 still read
  self.dist_setup.cp_size, which would AttributeError on any PP+CP VLM run.
- tests/unit_tests/recipes/test_finetune_vlm_cp_wiring.py monkeypatched
  the stale symbol "setup_distributed", causing three parametrizations of
  test_setup_skips_pp_media_prechunk_when_cp_preembeds_vlm_inputs to fail
  during pytest setup with AttributeError.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented May 26, 2026

/ok to test e85de37

akoumpa added 4 commits May 26, 2026 15:17
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented May 27, 2026

/ok to test fd99484

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented May 27, 2026

/ok to test 20415cf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants