-
Notifications
You must be signed in to change notification settings - Fork 79
Benchmark: Model benchmark - deterministic training support #731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add _enable_deterministic_training() method to set all necessary seeds - Add --deterministic and --random_seed command line arguments - Integrate deterministic training in _create_model() and _generate_dataset() - Add comprehensive unit tests for deterministic functionality - Tests validate parameter parsing, functionality, and regression scenarios - All tests pass and integrate with existing SuperBench test suite
…pass check_frequency to _is_finished in train/infer; add test capturing checksum log; stabilize fp32 loss path and small-dims determinism tests
…oss BERT/GPT2/CNN/LSTM/Mixtral; per-step fp32 loss logging; checksum logs; tests updated to strict/soft determinism pattern; add strict determinism CI guidance
…rings; fix GPT-2 params; soft vs strict checks stabilized
…sum tests with BERT pattern, improve docstrings and skip logic.
…BERT, GPT-2, LSTM, CNN, LLaMA examples
… models; update tests
…/CNN/BERT/Mixtral with periodic fingerprints, per-step loss capture, TF32 off, SDPA math kernel; add model_log_utils; update examples and tests, add env gating for cuBLAS.
…ted example file, remove redundant code
… unnecessary code
…idual model classes
… reduce redundant code
@microsoft-github-policy-service agree company="Microsoft" |
|
Thanks for addressing all the comments, since this is a big PR, could we do an apple-2-apple comparision before merging this PR. For example,
|
Tested and compared all the 3 items listed above. Looks good. |
| SuperBench now supports SDC to ensure reproducibility across runs. This includes fixed seeds and deterministic algorithms. To enable SDC, the following flags and environment variables must be set: | ||
|
|
||
| - **Flags:** | ||
| - `--deterministic`: Enables deterministic computation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use verb or noun for argument, e.g., enable-determinism
|
|
||
| - **Flags:** | ||
| - `--deterministic`: Enables deterministic computation. | ||
| - `--deterministic_seed <seed>`: Sets the seed for reproducibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random-seed?
| - **Flags:** | ||
| - `--deterministic`: Enables deterministic computation. | ||
| - `--deterministic_seed <seed>`: Sets the seed for reproducibility. | ||
| - `--generate_log` : Generates the log file that can be used as reference for comparison |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- what's current behavior of --generate-log in distributed training? seems there will be race conditions for multiple ranks writing the same log file
- I think you only record loss/activation every n steps, could you generate these into the metrics file similar as current performance results, rather than using a new log file?
| - `--deterministic`: Enables deterministic computation. | ||
| - `--deterministic_seed <seed>`: Sets the seed for reproducibility. | ||
| - `--generate_log` : Generates the log file that can be used as reference for comparison | ||
| - `--compare_log <path>`: Specifies the path to the reference log for comparison. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the comparison necessary? if the loss etc. are one of the metrics, it can be separately compared like current throughput etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comparison serves a different purpose than performance metrics. The feature was designed this way to ensure exact equality and more importantly to generate a reference log (golden data) once and validate all subsequent runs across different machines and runs against it.
I can integrate fingerprints into the standard metrics file (fixing the race condition), while keeping the exact-match comparison as a separate validation.
| - **Environment Variables:** | ||
| - `CUBLAS_WORKSPACE_CONFIG=:4096:8`: Ensures deterministic behavior in cuBLAS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should set this in code when determinism feature is enabled rather than asking user to set it separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took this approach because setting CUBLAS_WORKSPACE_CONFIG programmatically is challenging because it must be set before CUDA initializes, which happens during PyTorch import and benchmark construction. By the time we parse --deterministic flag, CUDA is already initialized. Or setting it before parsing the args would result in it being set ALL THE TIME even when it's not necessary.
In order to set CUBLAS_WORKSPACE_CONFIG=:4096:8 in the code, I would need to move torch.backends.cudnn.benchmark = True further down in the flow so that CUBLAS_WORKSPACE_CONFIG=:4096:8 before any CUDA operations are enabled.
My experiments show that moving it further down in the flow in _preprocess() does not affect any other parts of the code, however I want like second opinion on this to avoid breaking anything.
Please let me know if this sounds okay, thanks.
Current :
class PytorchBase(ModelBenchmark):
"""The base class of Pytorch model benchmarks."""
def __init__(self, name, parameters=''):
"""Constructor.
Args:
name (str): benchmark name.
parameters (str): benchmark parameters.
"""
super().__init__(name, parameters)
self._framework = Framework.PYTORCH
self._generate_log = False
self._compare_log = None
self._model_run_metadata = {}
self._model_run_losses = []
self._model_run_periodic = {}
New Approach :
def _preprocess(self):
"""Preprocess and apply PyTorch-specific defaults."""
preprocess_ok = super()._preprocess()
if not preprocess_ok:
return False
# Set CUBLAS_WORKSPACE_CONFIG and cudnn.benchmark based on deterministic mode
if getattr(self._args, 'deterministic', False):
# Set CUBLAS_WORKSPACE_CONFIG for deterministic CUDA operations
if 'CUBLAS_WORKSPACE_CONFIG' not in os.environ:
os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'
logger.info('Setting CUBLAS_WORKSPACE_CONFIG=:4096:8 for deterministic training')
else:
torch.backends.cudnn.benchmark = True
return True
| assert 'per_step_fp32_loss' in data | ||
| assert 'fingerprints' in data | ||
| assert isinstance(data['per_step_fp32_loss'], list) | ||
| assert isinstance(data['fingerprints'], dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to compare the detailed values here if it's deterministic?
| import pytest | ||
| from superbench.benchmarks import BenchmarkRegistry, Platform, Framework, ReturnCode | ||
|
|
||
| os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls set this in code rather than tests
| self._handle_deterministic_log_options() | ||
| return True | ||
|
|
||
| def set_deterministic_seed(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems most new functions here should be defined in model_base rather pytorch_base, you just have the pytorch related implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, determinism support is only implemented for PyTorch models, so I placed all the determinism-related functions in pytorch_base.py.
However, if it's recommended to place them in model_base, I can move the functions to that file. Please let me know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds deterministic training support to all PyTorch model benchmarks (BERT, GPT2, LLaMA, LSTM, CNN, Mixtral) to enable reproducible results. The implementation provides three key capabilities: deterministic mode with fixed seeds and stable operations, log generation to save training fingerprints, and log comparison to verify reproducibility across runs. The changes introduce new command-line flags (--deterministic, --generate-log, --compare-log, --check-frequency) and supporting infrastructure.
- Adds centralized deterministic training infrastructure in pytorch_base.py with seed control, algorithm determinism, and TF32 disabling
- Implements model log utilities for saving and comparing training fingerprints (loss values and activation statistics)
- Updates all six PyTorch model benchmarks consistently with fingerprint recording and tuple return values
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 25 comments.
Show a summary per file
| File | Description |
|---|---|
| superbench/common/model_log_utils.py | New utility module for saving, loading, and comparing model run logs with fingerprint data |
| superbench/benchmarks/base.py | Adds argument override from compare_log metadata for deterministic runs |
| superbench/benchmarks/model_benchmarks/model_base.py | Adds set_deterministic_seed() hook for framework-specific deterministic setup |
| superbench/benchmarks/model_benchmarks/pytorch_base.py | Core deterministic training infrastructure with CLI args, fingerprint recording, and post-run log handling |
| superbench/benchmarks/model_benchmarks/pytorch_bert.py | Updates BERT to record fingerprints and return tuple from _train_step |
| superbench/benchmarks/model_benchmarks/pytorch_gpt2.py | Updates GPT2 to record fingerprints and return tuple from _train_step |
| superbench/benchmarks/model_benchmarks/pytorch_llama.py | Updates LLaMA to record fingerprints and return tuple from _train_step |
| superbench/benchmarks/model_benchmarks/pytorch_lstm.py | Updates LSTM to record fingerprints and return tuple from _train_step |
| superbench/benchmarks/model_benchmarks/pytorch_cnn.py | Updates CNN to record fingerprints and return tuple from _train_step |
| superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py | Refactors model creation and adds fingerprint recording with tuple return |
| tests/benchmarks/test_base.py | Tests argument override from compare_log metadata |
| tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py | Comprehensive tests for determinism across all PyTorch models |
| examples/benchmarks/pytorch_deterministic_example.py | Example script demonstrating deterministic training usage |
| docs/user-tutorial/benchmarks/model-benchmarks.md | Documentation for new deterministic training features |
Comments suppressed due to low confidence (4)
superbench/benchmarks/model_benchmarks/pytorch_base.py:72
- Unnecessary 'pass' statement.
pass
superbench/benchmarks/model_benchmarks/pytorch_base.py:77
- Unnecessary 'pass' statement.
pass
superbench/benchmarks/model_benchmarks/pytorch_base.py:84
- Unnecessary 'pass' statement.
pass
superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py:216
- Unnecessary 'pass' statement.
pass
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Test argument override from compare_log metadata.""" | ||
| class DummyBenchmark(Benchmark): |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The test patches model_log_utils.load_model_log but doesn't test error cases (e.g., file not found, malformed JSON, missing metadata). Consider adding negative test cases to ensure robust error handling.
tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py
Outdated
Show resolved
Hide resolved
…to aishwaryatonpe/deterministic-training
…nd inference steps
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Support for deterministic training and reproducible logging to all PyTorch model benchmarks in SuperBench (BERT, GPT2, LLaMA, LSTM, CNN, Mixtral).
Deterministic mode: Makes sure model runs are consistent every time by fixing random seeds, turning off TF32, and using stable math operations.
Log generation: Saves key info like loss and activation stats during training.
Log comparison: Lets you compare a new run with a previous one to check if they match.
New command-line options:
--deterministic
--generate-log {log path of the file where you want to save the generated logs}
--compare-log {log path of the json file against which you want to compare the results of the current run}
--check-frequency
Changes -
Updated pytorch_base.py to handle deterministic settings, logging, and comparisons.
Added a new example script: pytorch_deterministic_example.py
Added a test file: test_pytorch_determinism_all.py to verify everything works as expected.
Usage -
Run with --deterministic --generate-log to create a reference log.
Run again with --compare-log to check if the new run matches the reference.
Make sure CUBLAS_WORKSPACE_CONFIG is set and all parameters stay the same between runs.