Skip to content

[mxfp8 training] unified MXFP8TrainingConfig and MXFP8TrainingTensor#3948

Closed
danielvegamyhre wants to merge 1 commit intomainfrom
danielvegamyhre/stack/145
Closed

[mxfp8 training] unified MXFP8TrainingConfig and MXFP8TrainingTensor#3948
danielvegamyhre wants to merge 1 commit intomainfrom
danielvegamyhre/stack/145

Conversation

@danielvegamyhre
Copy link
Contributor

@danielvegamyhre danielvegamyhre commented Feb 25, 2026

Stacked PRs:


[mxfp8 training] unified MXFP8TrainingConfig and MXFP8TrainingTensor

Config changes

  • Add unified MXFP8TrainingConfig for linear and grouped_mm. This replaces MXFP8GroupedMMConfig.
    • Simplified set of options shared between both ops. AUTO kernel preference handles all the different kernel dispatch options distinct to each op in an opinionated way, using the best kernel we have for that operation based on our benchmarks.
  • Rename MXFP8GroupedMMRecipe -> MXFP8TrainingRecipe.

Tensor subclass changes

  • Rename ScaledGroupedMMTensor to MXFP8TrainingTensor

Autograd func changes

  • Add convenience wrapper _to_mxfp8_then_scaled_mm
  • Update mx_mm autograd func to support wgrad_with_hp recipe

Temporarily removed FP8GroupedMM quantize_ workflow support

  • In the future, we can refactor float8 training to follow a similar pattern if we wish. For now, given that FP8 MOE training is in a less mature state (~10% TPS increase for Llama4 Scout, less interest in github issues/PRs/etc), I am simplifying this MXFP8 refactor effort by disabling FP8 GroupedMM tests and workflow support (quantize_()).
  • If/when we get to FP8 blockwise training, this can be added back without much effort.

Tests

  • Added linear test cases in test/prototype/moe_training/test_training.py
  • ./test/prototype/moe_training/test_everything.sh

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 25, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3948

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 1 Unrelated Failure

As of commit 936b309 with merge base 4ae435e (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 25, 2026
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/145 branch from 06f5ff4 to 7e69751 Compare February 25, 2026 06:29
@danielvegamyhre danielvegamyhre added mx module: training quantize_ api training flow moe labels Feb 25, 2026
@danielvegamyhre danielvegamyhre marked this pull request as draft February 25, 2026 16:38
@danielvegamyhre danielvegamyhre marked this pull request as ready for review February 25, 2026 16:39
stack-info: PR: #3948, branch: danielvegamyhre/stack/145
@danielvegamyhre danielvegamyhre marked this pull request as draft February 28, 2026 00:28
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/145 branch from 7e69751 to 936b309 Compare February 28, 2026 00:28
@danielvegamyhre danielvegamyhre marked this pull request as ready for review February 28, 2026 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: training quantize_ api training flow moe mx

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant