Skip to content

CI: Detect FAILURE in halo exchange test if any#321

Open
leo-amd wants to merge 1 commit intomasterfrom
leo/peer-halo-exchange-test-propagate-errors
Open

CI: Detect FAILURE in halo exchange test if any#321
leo-amd wants to merge 1 commit intomasterfrom
leo/peer-halo-exchange-test-propagate-errors

Conversation

@leo-amd
Copy link
Collaborator

@leo-amd leo-amd commented Mar 19, 2026

To propagate the errors to the CI
Example of the run where we didn't find the failures and greened a run https://github.com/ROCm/apex/actions/runs/23252552706/job/67612063034?pr=320

@leo-amd leo-amd requested a review from jithunnair-amd March 19, 2026 15:47
export HSA_FORCE_FINE_GRAIN_PCIE=1
export HSA_ENABLE_SDMA=0
torchrun --nproc_per_node 8 apex/contrib/peer_memory/peer_halo_exchange_module_tests.py 2>&1 | tee halo_results.log
! grep -q 'FAILURE :' halo_results.log
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leo-amd @amd-sriram Whatever error detection logic we use should be applied to all the test runs. But that also begs the question why we need to do the above, and why doesn't it exit with nonzero exit code for halo tests? Is torchrun to blame?

Copy link
Collaborator

@amd-sriram amd-sriram Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jithunnair-amd @leo-amd We should try to use an assert statement similar to https://github.com/NVIDIA/apex/blob/master/apex/contrib/test/peer_memory/test_peer_halo_exchange_module.py#L134.

torch.testing.assert_close(list_y, list_y2, msg=memory_format_str)
I was trying to run the halo tests but I could only run it only once. So, couldn't check if the assert statement would help.

Copy link
Collaborator

@amd-sriram amd-sriram Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jithunnair-amd Made a PR with the assert statement and also addresses the timeout error - #323

Copy link
Collaborator

@jithunnair-amd jithunnair-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amd-sriram https://github.com/ROCm/apex/actions/runs/23303497615/job/67780836893?pr=321 runs the Halo exchange tests for 10-11h!!! That's not tenable at all. We need to reduce the runtime for these tests, or disable them in the meantime if a resolution is not straightforward.

@amd-sriram
Copy link
Collaborator

@jithunnair-amd @leo-amd We could check if the assert statement reduces the time taken for the halo test. If the time doesn't reduce, then we can disable it for the mean time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants