Benchmark: Model benchmark - deterministic training support #731

Aishwarya-Tonpe · 2025-08-28T17:41:54Z

Support for deterministic training and reproducible logging to all PyTorch model benchmarks in SuperBench (BERT, GPT2, LLaMA, LSTM, CNN, Mixtral).

Deterministic mode: Makes sure model runs are consistent every time by fixing random seeds, turning off TF32, and using stable math operations.
Log generation: Saves key info like loss and activation stats during training.
Log comparison: Lets you compare a new run with a previous one to check if they match.
New command-line options:

--deterministic
--generate-log {log path of the file where you want to save the generated logs}
--compare-log {log path of the json file against which you want to compare the results of the current run}
--check-frequency

Changes -

Updated pytorch_base.py to handle deterministic settings, logging, and comparisons.
Added a new example script: pytorch_deterministic_example.py
Added a test file: test_pytorch_determinism_all.py to verify everything works as expected.

Usage -

Run with --deterministic --generate-log to create a reference log.
Run again with --compare-log to check if the new run matches the reference.
Make sure CUBLAS_WORKSPACE_CONFIG is set and all parameters stay the same between runs.

- Add _enable_deterministic_training() method to set all necessary seeds - Add --deterministic and --random_seed command line arguments - Integrate deterministic training in _create_model() and _generate_dataset() - Add comprehensive unit tests for deterministic functionality - Tests validate parameter parsing, functionality, and regression scenarios - All tests pass and integrate with existing SuperBench test suite

…pass check_frequency to _is_finished in train/infer; add test capturing checksum log; stabilize fp32 loss path and small-dims determinism tests

…oss BERT/GPT2/CNN/LSTM/Mixtral; per-step fp32 loss logging; checksum logs; tests updated to strict/soft determinism pattern; add strict determinism CI guidance

…rings; fix GPT-2 params; soft vs strict checks stabilized

…sum tests with BERT pattern, improve docstrings and skip logic.

…BERT, GPT-2, LSTM, CNN, LLaMA examples

… models; update tests

…/CNN/BERT/Mixtral with periodic fingerprints, per-step loss capture, TF32 off, SDPA math kernel; add model_log_utils; update examples and tests, add env gating for cuBLAS.

…ted example file, remove redundant code

… unnecessary code

…idual model classes

… reduce redundant code

tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py

…in the test file

Aishwarya-Tonpe · 2025-08-28T19:55:34Z

@Aishwarya-Tonpe please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree company="Microsoft"

…se file

superbench/benchmarks/model_benchmarks/pytorch_bert.py

superbench/benchmarks/base.py

superbench/benchmarks/model_benchmarks/pytorch_base.py

superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py

guoshzhao · 2025-10-09T17:55:15Z

Thanks for addressing all the comments, since this is a big PR, could we do an apple-2-apple comparision before merging this PR. For example,

Run all e2e model benchmark based on main branch.
Run all e2e model benchmark based on this branch with deterministic training disabled.
Run all e2e model benchmark based on this branch with deterministic training enabled.
And compare if throughput metrics are expected?

Aishwarya-Tonpe · 2025-10-13T19:18:54Z

Thanks for addressing all the comments, since this is a big PR, could we do an apple-2-apple comparision before merging this PR. For example,

Run all e2e model benchmark based on main branch.

Run all e2e model benchmark based on this branch with deterministic training disabled.

Run all e2e model benchmark based on this branch with deterministic training enabled.
And compare if throughput metrics are expected?

Tested and compared all the 3 items listed above. Looks good.
Can share the result files if needed, please lmk. thank you!

abuccts · 2025-11-27T22:51:02Z