CI: Detect FAILURE in halo exchange test if any by leo-amd · Pull Request #321 · ROCm/apex

leo-amd · 2026-03-19T15:47:00Z

To propagate the errors to the CI
Example of the run where we didn't find the failures and greened a run https://github.com/ROCm/apex/actions/runs/23252552706/job/67612063034?pr=320

jithunnair-amd · 2026-03-23T18:00:41Z

.github/workflows/rocm-ci.yml

            export HSA_FORCE_FINE_GRAIN_PCIE=1
            export HSA_ENABLE_SDMA=0
            torchrun --nproc_per_node 8 apex/contrib/peer_memory/peer_halo_exchange_module_tests.py 2>&1 | tee halo_results.log
+            ! grep -q 'FAILURE :' halo_results.log


@leo-amd @amd-sriram Whatever error detection logic we use should be applied to all the test runs. But that also begs the question why we need to do the above, and why doesn't it exit with nonzero exit code for halo tests? Is torchrun to blame?

@jithunnair-amd @leo-amd We should try to use an assert statement similar to https://github.com/NVIDIA/apex/blob/master/apex/contrib/test/peer_memory/test_peer_halo_exchange_module.py#L134.

torch.testing.assert_close(list_y, list_y2, msg=memory_format_str)
I was trying to run the halo tests but I could only run it only once. So, couldn't check if the assert statement would help.

@jithunnair-amd Made a PR with the assert statement and also addresses the timeout error - #323

jithunnair-amd

@amd-sriram https://github.com/ROCm/apex/actions/runs/23303497615/job/67780836893?pr=321 runs the Halo exchange tests for 10-11h!!! That's not tenable at all. We need to reduce the runtime for these tests, or disable them in the meantime if a resolution is not straightforward.

amd-sriram · 2026-03-24T18:24:50Z

@jithunnair-amd @leo-amd We could check if the assert statement reduces the time taken for the halo test. If the time doesn't reduce, then we can disable it for the mean time.

Find errors in halo exchange test if any

bc7f81d

leo-amd requested a review from jithunnair-amd March 19, 2026 15:47

jithunnair-amd reviewed Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: Detect FAILURE in halo exchange test if any#321

CI: Detect FAILURE in halo exchange test if any#321
leo-amd wants to merge 1 commit intomasterfrom
leo/peer-halo-exchange-test-propagate-errors

leo-amd commented Mar 19, 2026

Uh oh!

jithunnair-amd Mar 23, 2026

Uh oh!

amd-sriram Mar 23, 2026 •

edited

Loading

Uh oh!

amd-sriram Mar 27, 2026 •

edited

Loading

Uh oh!

jithunnair-amd left a comment

Uh oh!

amd-sriram commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

leo-amd commented Mar 19, 2026

Uh oh!

jithunnair-amd Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

amd-sriram Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amd-sriram Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jithunnair-amd left a comment

Choose a reason for hiding this comment

Uh oh!

amd-sriram commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amd-sriram Mar 23, 2026 •

edited

Loading

amd-sriram Mar 27, 2026 •

edited

Loading