Skip to content

feat: INT32 atomic MoE kernel for deterministic accumulation#2005

Open
Mortis-Huang wants to merge 1 commit intoROCm:dev/perffrom
Mortis-Huang:mortis/int32-atomic-moe
Open

feat: INT32 atomic MoE kernel for deterministic accumulation#2005
Mortis-Huang wants to merge 1 commit intoROCm:dev/perffrom
Mortis-Huang:mortis/int32-atomic-moe

Conversation

@Mortis-Huang
Copy link

@Mortis-Huang Mortis-Huang commented Feb 9, 2026

Summary

  • Replace global_atomic_pk_add_bf16 with global_atomic_add (INT32) in MoE kernel to achieve bit-identical results across runs
  • Add INT32 post-processing in fused_moe.py (scale factor 32768, interleaved layout decode)
  • Force kernel selection to the modified non-PS non-tkw1 variant to avoid unmodified kernel paths

How to Enable

export AITER_MOE_INT32_ATOMIC=1

Files Changed

File Description
aiter/fused_moe.py INT32 post-processing + forced kernel selection logic
hsa/gfx942/fmoe/silu/fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x192.co INT32 atomic kernel binary (scale=32768)

Test Results

Consistency (bit-identical)

  • 10/10 runs identical across all 131,072 output elements
  • Verified with verify_int32_consistency.py

GSM8K Accuracy (parallel=64, 1319 questions)

Metric BF16 Baseline INT32 Diff
Accuracy 94.6% 94.5% -0.1%
Invalid Rate 0.1% 0.0% Improved
Output Throughput 1262 tok/s 1099 tok/s -12.9%

TTFT (input=8000 tokens, concurrency=1)

Metric BF16 INT32 Diff
Mean TTFT 382 ms 458 ms +20%
Mean TPOT 9.54 ms 9.49 ms -0.5%

Output Stability (request_ali.py, temperature=0)

Metric BF16 INT32
Valid outputs 5/10 (50%) 10/10 (100%)

Known Limitations

  • TTFT ~20% slower due to forced kernel selection (only non-PS non-tkw1 variant modified)
  • Throughput ~13% lower for the same reason
  • Future improvement: modify all kernel variants (PS, tkw1, tkw1-PS) to restore AITER heuristic

Test Plan

  • 10-run bit-identical consistency test
  • GSM8K accuracy benchmark
  • TTFT latency benchmark
  • Output stability test (request_ali.py)

Replace BF16 atomic add with INT32 atomic add in MoE kernel to achieve
bit-identical results across runs. Enable via: export AITER_MOE_INT32_ATOMIC=1

- Kernel: INT32 scale factor 32768, interleaved layout
- Python: INT32 post-processing + forced kernel selection
- Accuracy: 94.5% GSM8K (vs 94.6% BF16 baseline)
- Consistency: 10/10 bit-identical (131,072 elements)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants