Skip to content

Conversation

@Aishwarya-Tonpe
Copy link

@Aishwarya-Tonpe Aishwarya-Tonpe commented Aug 28, 2025

Support for deterministic training and reproducible logging to all PyTorch model benchmarks in SuperBench (BERT, GPT2, LLaMA, LSTM, CNN, Mixtral).

Deterministic mode: Makes sure model runs are consistent every time by fixing random seeds, turning off TF32, and using stable math operations.
Log generation: Saves key info like loss and activation stats during training.
Log comparison: Lets you compare a new run with a previous one to check if they match.
New command-line options:

--deterministic
--generate-log {log path of the file where you want to save the generated logs}
--compare-log {log path of the json file against which you want to compare the results of the current run}
--check-frequency

Changes -

Updated pytorch_base.py to handle deterministic settings, logging, and comparisons.
Added a new example script: pytorch_deterministic_example.py
Added a test file: test_pytorch_determinism_all.py to verify everything works as expected.

Usage -

Run with --deterministic --generate-log to create a reference log.
Run again with --compare-log to check if the new run matches the reference.
Make sure CUBLAS_WORKSPACE_CONFIG is set and all parameters stay the same between runs.

- Add _enable_deterministic_training() method to set all necessary seeds
- Add --deterministic and --random_seed command line arguments
- Integrate deterministic training in _create_model() and _generate_dataset()
- Add comprehensive unit tests for deterministic functionality
- Tests validate parameter parsing, functionality, and regression scenarios
- All tests pass and integrate with existing SuperBench test suite
…pass check_frequency to _is_finished in train/infer; add test capturing checksum log; stabilize fp32 loss path and small-dims determinism tests
…oss BERT/GPT2/CNN/LSTM/Mixtral; per-step fp32 loss logging; checksum logs; tests updated to strict/soft determinism pattern; add strict determinism CI guidance
…rings; fix GPT-2 params; soft vs strict checks stabilized
…sum tests with BERT pattern, improve docstrings and skip logic.
…/CNN/BERT/Mixtral with periodic fingerprints, per-step loss capture, TF32 off, SDPA math kernel; add model_log_utils; update examples and tests, add env gating for cuBLAS.
@Aishwarya-Tonpe Aishwarya-Tonpe requested a review from a team as a code owner August 28, 2025 17:41
@Aishwarya-Tonpe
Copy link
Author

@Aishwarya-Tonpe please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree company="Microsoft"

@guoshzhao
Copy link
Contributor

Thanks for addressing all the comments, since this is a big PR, could we do an apple-2-apple comparision before merging this PR. For example,

  1. Run all e2e model benchmark based on main branch.
  2. Run all e2e model benchmark based on this branch with deterministic training disabled.
  3. Run all e2e model benchmark based on this branch with deterministic training enabled.
    And compare if throughput metrics are expected?

@Aishwarya-Tonpe
Copy link
Author

Thanks for addressing all the comments, since this is a big PR, could we do an apple-2-apple comparision before merging this PR. For example,

  1. Run all e2e model benchmark based on main branch.
  2. Run all e2e model benchmark based on this branch with deterministic training disabled.
  3. Run all e2e model benchmark based on this branch with deterministic training enabled.
    And compare if throughput metrics are expected?

Tested and compared all the 3 items listed above. Looks good.
Can share the result files if needed, please lmk. thank you!

@guoshzhao guoshzhao added benchmarks SuperBench Benchmarks model-benchmarks Model Benchmark Test for SuperBench Benchmarks labels Oct 17, 2025
@polarG polarG self-requested a review October 29, 2025 23:26
@guoshzhao guoshzhao requested a review from abuccts November 19, 2025 00:21
SuperBench now supports SDC to ensure reproducibility across runs. This includes fixed seeds and deterministic algorithms. To enable SDC, the following flags and environment variables must be set:

- **Flags:**
- `--deterministic`: Enables deterministic computation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use verb or noun for argument, e.g., enable-determinism


- **Flags:**
- `--deterministic`: Enables deterministic computation.
- `--deterministic_seed <seed>`: Sets the seed for reproducibility.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random-seed?

- **Flags:**
- `--deterministic`: Enables deterministic computation.
- `--deterministic_seed <seed>`: Sets the seed for reproducibility.
- `--generate_log` : Generates the log file that can be used as reference for comparison
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. what's current behavior of --generate-log in distributed training? seems there will be race conditions for multiple ranks writing the same log file
  2. I think you only record loss/activation every n steps, could you generate these into the metrics file similar as current performance results, rather than using a new log file?

- `--deterministic`: Enables deterministic computation.
- `--deterministic_seed <seed>`: Sets the seed for reproducibility.
- `--generate_log` : Generates the log file that can be used as reference for comparison
- `--compare_log <path>`: Specifies the path to the reference log for comparison.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the comparison necessary? if the loss etc. are one of the metrics, it can be separately compared like current throughput etc.

Copy link
Author

@Aishwarya-Tonpe Aishwarya-Tonpe Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comparison serves a different purpose than performance metrics. The feature was designed this way to ensure exact equality and more importantly to generate a reference log (golden data) once and validate all subsequent runs across different machines and runs against it.
I can integrate fingerprints into the standard metrics file (fixing the race condition), while keeping the exact-match comparison as a separate validation.

Comment on lines +46 to +47
- **Environment Variables:**
- `CUBLAS_WORKSPACE_CONFIG=:4096:8`: Ensures deterministic behavior in cuBLAS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should set this in code when determinism feature is enabled rather than asking user to set it separately?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this approach because setting CUBLAS_WORKSPACE_CONFIG programmatically is challenging because it must be set before CUDA initializes, which happens during PyTorch import and benchmark construction. By the time we parse --deterministic flag, CUDA is already initialized. Or setting it before parsing the args would result in it being set ALL THE TIME even when it's not necessary.
In order to set CUBLAS_WORKSPACE_CONFIG=:4096:8 in the code, I would need to move torch.backends.cudnn.benchmark = True further down in the flow so that CUBLAS_WORKSPACE_CONFIG=:4096:8 before any CUDA operations are enabled.
My experiments show that moving it further down in the flow in _preprocess() does not affect any other parts of the code, however I want like second opinion on this to avoid breaking anything.
Please let me know if this sounds okay, thanks.

Current :

class PytorchBase(ModelBenchmark):
    """The base class of Pytorch model benchmarks."""
    def __init__(self, name, parameters=''):
        """Constructor.

        Args:
            name (str): benchmark name.
            parameters (str): benchmark parameters.
        """
        super().__init__(name, parameters)

        self._framework = Framework.PYTORCH

        self._generate_log = False
        self._compare_log = None
        self._model_run_metadata = {}
        self._model_run_losses = []
        self._model_run_periodic = {}

New Approach :

def _preprocess(self):
        """Preprocess and apply PyTorch-specific defaults."""
        preprocess_ok = super()._preprocess()
        if not preprocess_ok:
            return False

        # Set CUBLAS_WORKSPACE_CONFIG and cudnn.benchmark based on deterministic mode
        if getattr(self._args, 'deterministic', False):
            # Set CUBLAS_WORKSPACE_CONFIG for deterministic CUDA operations
            if 'CUBLAS_WORKSPACE_CONFIG' not in os.environ:
                os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'
                logger.info('Setting CUBLAS_WORKSPACE_CONFIG=:4096:8 for deterministic training')
        else:
            torch.backends.cudnn.benchmark = True

        return True

assert 'per_step_fp32_loss' in data
assert 'fingerprints' in data
assert isinstance(data['per_step_fp32_loss'], list)
assert isinstance(data['fingerprints'], dict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to compare the detailed values here if it's deterministic?

import pytest
from superbench.benchmarks import BenchmarkRegistry, Platform, Framework, ReturnCode

os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls set this in code rather than tests

self._handle_deterministic_log_options()
return True

def set_deterministic_seed(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems most new functions here should be defined in model_base rather pytorch_base, you just have the pytorch related implementation

Copy link
Author

@Aishwarya-Tonpe Aishwarya-Tonpe Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, determinism support is only implemented for PyTorch models, so I placed all the determinism-related functions in pytorch_base.py.
However, if it's recommended to place them in model_base, I can move the functions to that file. Please let me know

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds deterministic training support to all PyTorch model benchmarks (BERT, GPT2, LLaMA, LSTM, CNN, Mixtral) to enable reproducible results. The implementation provides three key capabilities: deterministic mode with fixed seeds and stable operations, log generation to save training fingerprints, and log comparison to verify reproducibility across runs. The changes introduce new command-line flags (--deterministic, --generate-log, --compare-log, --check-frequency) and supporting infrastructure.

  • Adds centralized deterministic training infrastructure in pytorch_base.py with seed control, algorithm determinism, and TF32 disabling
  • Implements model log utilities for saving and comparing training fingerprints (loss values and activation statistics)
  • Updates all six PyTorch model benchmarks consistently with fingerprint recording and tuple return values

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 25 comments.

Show a summary per file
File Description
superbench/common/model_log_utils.py New utility module for saving, loading, and comparing model run logs with fingerprint data
superbench/benchmarks/base.py Adds argument override from compare_log metadata for deterministic runs
superbench/benchmarks/model_benchmarks/model_base.py Adds set_deterministic_seed() hook for framework-specific deterministic setup
superbench/benchmarks/model_benchmarks/pytorch_base.py Core deterministic training infrastructure with CLI args, fingerprint recording, and post-run log handling
superbench/benchmarks/model_benchmarks/pytorch_bert.py Updates BERT to record fingerprints and return tuple from _train_step
superbench/benchmarks/model_benchmarks/pytorch_gpt2.py Updates GPT2 to record fingerprints and return tuple from _train_step
superbench/benchmarks/model_benchmarks/pytorch_llama.py Updates LLaMA to record fingerprints and return tuple from _train_step
superbench/benchmarks/model_benchmarks/pytorch_lstm.py Updates LSTM to record fingerprints and return tuple from _train_step
superbench/benchmarks/model_benchmarks/pytorch_cnn.py Updates CNN to record fingerprints and return tuple from _train_step
superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py Refactors model creation and adds fingerprint recording with tuple return
tests/benchmarks/test_base.py Tests argument override from compare_log metadata
tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py Comprehensive tests for determinism across all PyTorch models
examples/benchmarks/pytorch_deterministic_example.py Example script demonstrating deterministic training usage
docs/user-tutorial/benchmarks/model-benchmarks.md Documentation for new deterministic training features
Comments suppressed due to low confidence (4)

superbench/benchmarks/model_benchmarks/pytorch_base.py:72

  • Unnecessary 'pass' statement.
            pass

superbench/benchmarks/model_benchmarks/pytorch_base.py:77

  • Unnecessary 'pass' statement.
            pass

superbench/benchmarks/model_benchmarks/pytorch_base.py:84

  • Unnecessary 'pass' statement.
            pass

superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py:216

  • Unnecessary 'pass' statement.
            pass

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +80 to +81
"""Test argument override from compare_log metadata."""
class DummyBenchmark(Benchmark):
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The test patches model_log_utils.load_model_log but doesn't test error cases (e.g., file not found, malformed JSON, missing metadata). Consider adding negative test cases to ensure robust error handling.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmarks SuperBench Benchmarks model-benchmarks Model Benchmark Test for SuperBench Benchmarks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants