feat: INT32 atomic MoE kernel for deterministic accumulation by Mortis-Huang · Pull Request #2005 · ROCm/aiter

Mortis-Huang · 2026-02-09T06:00:05Z

Summary

Replace global_atomic_pk_add_bf16 with global_atomic_add (INT32) in MoE kernel to achieve bit-identical results across runs
Add INT32 post-processing in fused_moe.py (scale factor 32768, interleaved layout decode)
Force kernel selection to the modified non-PS non-tkw1 variant to avoid unmodified kernel paths

How to Enable

export AITER_MOE_INT32_ATOMIC=1

Files Changed

File	Description
`aiter/fused_moe.py`	INT32 post-processing + forced kernel selection logic
`hsa/gfx942/fmoe/silu/fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x192.co`	INT32 atomic kernel binary (scale=32768)

Test Results

Consistency (bit-identical)

10/10 runs identical across all 131,072 output elements
Verified with verify_int32_consistency.py

GSM8K Accuracy (parallel=64, 1319 questions)

Metric	BF16 Baseline	INT32	Diff
Accuracy	94.6%	94.5%	-0.1%
Invalid Rate	0.1%	0.0%	Improved
Output Throughput	1262 tok/s	1099 tok/s	-12.9%

TTFT (input=8000 tokens, concurrency=1)

Metric	BF16	INT32	Diff
Mean TTFT	382 ms	458 ms	+20%
Mean TPOT	9.54 ms	9.49 ms	-0.5%

Output Stability (request_ali.py, temperature=0)

Metric	BF16	INT32
Valid outputs	5/10 (50%)	10/10 (100%)

Known Limitations

TTFT ~20% slower due to forced kernel selection (only non-PS non-tkw1 variant modified)
Throughput ~13% lower for the same reason
Future improvement: modify all kernel variants (PS, tkw1, tkw1-PS) to restore AITER heuristic

Test Plan

10-run bit-identical consistency test
GSM8K accuracy benchmark
TTFT latency benchmark
Output stability test (request_ali.py)

Replace BF16 atomic add with INT32 atomic add in MoE kernel to achieve bit-identical results across runs. Enable via: export AITER_MOE_INT32_ATOMIC=1 - Kernel: INT32 scale factor 32768, interleaved layout - Python: INT32 post-processing + forced kernel selection - Accuracy: 94.5% GSM8K (vs 94.6% BF16 baseline) - Consistency: 10/10 bit-identical (131,072 elements)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: INT32 atomic MoE kernel for deterministic accumulation#2005

feat: INT32 atomic MoE kernel for deterministic accumulation#2005
Mortis-Huang wants to merge 1 commit intoROCm:dev/perffrom
Mortis-Huang:mortis/int32-atomic-moe

Mortis-Huang commented Feb 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Mortis-Huang commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How to Enable

Files Changed

Test Results

Consistency (bit-identical)

GSM8K Accuracy (parallel=64, 1319 questions)

TTFT (input=8000 tokens, concurrency=1)

Output Stability (request_ali.py, temperature=0)

Known Limitations

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Mortis-Huang commented Feb 9, 2026 •

edited

Loading