[mxfp8 training] unified MXFP8TrainingConfig and MXFP8TrainingTensor by danielvegamyhre · Pull Request #3948 · pytorch/ao

danielvegamyhre · 2026-02-25T06:29:43Z

Stacked PRs:

[mxfp8 training] unified MXFP8TrainingConfig and MXFP8TrainingTensor

Config changes

Add unified MXFP8TrainingConfig for linear and grouped_mm. This replaces MXFP8GroupedMMConfig.
- Simplified set of options shared between both ops. AUTO kernel preference handles all the different kernel dispatch options distinct to each op in an opinionated way, using the best kernel we have for that operation based on our benchmarks.
Rename MXFP8GroupedMMRecipe -> MXFP8TrainingRecipe.

Tensor subclass changes

Rename ScaledGroupedMMTensor to MXFP8TrainingTensor

Autograd func changes

Add convenience wrapper _to_mxfp8_then_scaled_mm
Update mx_mm autograd func to support wgrad_with_hp recipe

Temporarily removed FP8GroupedMM quantize_ workflow support

In the future, we can refactor float8 training to follow a similar pattern if we wish. For now, given that FP8 MOE training is in a less mature state (~10% TPS increase for Llama4 Scout, less interest in github issues/PRs/etc), I am simplifying this MXFP8 refactor effort by disabling FP8 GroupedMM tests and workflow support (quantize_()).
If/when we get to FP8 blockwise training, this can be added back without much effort.

Tests

Added linear test cases in test/prototype/moe_training/test_training.py
./test/prototype/moe_training/test_everything.sh

pytorch-bot · 2026-02-25T06:29:47Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3948

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 1 Unrelated Failure

As of commit 936b309 with merge base 4ae435e ():

NEW FAILURES - The following jobs have failed:

Run 1xH100 Tests / test (H100, linux.aws.h100, --pre torch torchvision torchaudio mslk --index-url https://download.... / linux-job (gh)
RuntimeError: Command docker exec -t 60b92b094bbff5d4b38ac4685a7603dc2d66ebf541cf7b7ce373151c4d7104f1 /exec failed with exit code 1
Run Regression Tests / test (CUDA 2.10, linux.g5.12xlarge.nvidia.gpu, torch==2.10.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t 595b22c7356a98150c0ee7f9b49c3889282f16e810c61377282eb7a5c91b1589 /exec failed with exit code 1
Run Regression Tests / test (CUDA 2.8, linux.g5.12xlarge.nvidia.gpu, torch==2.8.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t 6c2e0d29893634476af089f6978d1aaab510fd1d80d7b97d801695fe9a80b9aa /exec failed with exit code 1
Run Regression Tests / test (CUDA 2.9, linux.g5.12xlarge.nvidia.gpu, torch==2.9.1, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t c7b5661bd39231ab916ce987b98ca4559043c6be66a9b2099c5c3a99bac117cb /exec failed with exit code 1

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh) (trunk failure)
test/quantization/pt2e/test_x86inductor_quantizer.py::TestQuantizePT2EX86Inductor::test_set_module_name_with_mixed_configs

This comment was automatically generated by Dr. CI and updates every 15 minutes.

stack-info: PR: #3948, branch: danielvegamyhre/stack/145

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 25, 2026

danielvegamyhre force-pushed the danielvegamyhre/stack/145 branch from 06f5ff4 to 7e69751 Compare February 25, 2026 06:29

This was referenced Feb 25, 2026

[mxfp8 training] remove mxfp8 from MXLinear and MXLinearConfig #3949

Closed

[mxfp8 training] unified MXFP8TrainingConfig and MXFP8TrainingTensor #3936

Closed

danielvegamyhre added mx module: training quantize_ api training flow moe labels Feb 25, 2026

danielvegamyhre marked this pull request as draft February 25, 2026 16:38

danielvegamyhre marked this pull request as ready for review February 25, 2026 16:39

[mxfp8 training] unified MXFP8TrainingConfig and MXFP8TrainingTensor

936b309

stack-info: PR: #3948, branch: danielvegamyhre/stack/145

danielvegamyhre marked this pull request as draft February 28, 2026 00:28

danielvegamyhre force-pushed the danielvegamyhre/stack/145 branch from 7e69751 to 936b309 Compare February 28, 2026 00:28

danielvegamyhre marked this pull request as ready for review February 28, 2026 00:28

danielvegamyhre closed this Mar 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mxfp8 training] unified MXFP8TrainingConfig and MXFP8TrainingTensor#3948

[mxfp8 training] unified MXFP8TrainingConfig and MXFP8TrainingTensor#3948
danielvegamyhre wants to merge 1 commit intomainfrom
danielvegamyhre/stack/145

danielvegamyhre commented Feb 25, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielvegamyhre commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!