CI: Detect FAILURE in halo exchange test if any#321
Conversation
| export HSA_FORCE_FINE_GRAIN_PCIE=1 | ||
| export HSA_ENABLE_SDMA=0 | ||
| torchrun --nproc_per_node 8 apex/contrib/peer_memory/peer_halo_exchange_module_tests.py 2>&1 | tee halo_results.log | ||
| ! grep -q 'FAILURE :' halo_results.log |
There was a problem hiding this comment.
@leo-amd @amd-sriram Whatever error detection logic we use should be applied to all the test runs. But that also begs the question why we need to do the above, and why doesn't it exit with nonzero exit code for halo tests? Is torchrun to blame?
There was a problem hiding this comment.
@jithunnair-amd @leo-amd We should try to use an assert statement similar to https://github.com/NVIDIA/apex/blob/master/apex/contrib/test/peer_memory/test_peer_halo_exchange_module.py#L134.
torch.testing.assert_close(list_y, list_y2, msg=memory_format_str)
I was trying to run the halo tests but I could only run it only once. So, couldn't check if the assert statement would help.
There was a problem hiding this comment.
@jithunnair-amd Made a PR with the assert statement and also addresses the timeout error - #323
jithunnair-amd
left a comment
There was a problem hiding this comment.
@amd-sriram https://github.com/ROCm/apex/actions/runs/23303497615/job/67780836893?pr=321 runs the Halo exchange tests for 10-11h!!! That's not tenable at all. We need to reduce the runtime for these tests, or disable them in the meantime if a resolution is not straightforward.
|
@jithunnair-amd @leo-amd We could check if the assert statement reduces the time taken for the halo test. If the time doesn't reduce, then we can disable it for the mean time. |
To propagate the errors to the CI
Example of the run where we didn't find the failures and greened a run https://github.com/ROCm/apex/actions/runs/23252552706/job/67612063034?pr=320