Skip to content

DO NOT MERGE: debug CUDA T4 max-reducer hang on PR #635 (Xid 31)#589

Draft
duburcqa wants to merge 3 commits intomainfrom
duburcqa/debug_metal_grad_repro
Draft

DO NOT MERGE: debug CUDA T4 max-reducer hang on PR #635 (Xid 31)#589
duburcqa wants to merge 3 commits intomainfrom
duburcqa/debug_metal_grad_repro

Conversation

@duburcqa
Copy link
Copy Markdown
Contributor

@duburcqa duburcqa commented Apr 29, 2026

Debug-only PR. Reproduces and pinpoints the CUDA T4 failure observed on PR #635 (duburcqa/adstack_max_reducer_shader).

What we found in the previous tmate session

The failure reproduces deterministically on gpu-t4-4-core (Tesla T4, driver 590.48.01 open kernel module, HMM-capable) under pytest -n 8 --count 50 tests/python/test_ad_ndarray.py -k cuda. About one-in-N workers crashes on its first or second launched test with:

[E cuda_driver.h:operator()@93] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS while calling cuStreamSynchronize
[E cuda_driver.h:operator()@93] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS while calling cuMemFreeAsync
[E cuda_driver.h:operator()@93] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS while calling cuMemsetD8_v2
terminate called after throwing an instance of 'std::__cxx11::basic_string'

dmesg confirms the underlying cause is a GPU MMU page fault on a kernel write:

NVRM: Xid (PCI:0001:00:00): 31, pid=37803, name=[pytest-xdist r, channel 0x00000201, intr 00000000.
  MMU Fault: ENGINE GRAPHICS GPC0 GPCCLIENT_T1_0 faulted @ 0x70a2_c8000000.
  Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE
NVRM: Xid (PCI:0001:00:00): 109, pid=37803, ... CTX SWITCH TIMEOUT

A CUDA kernel writes to virtual address 0x70a2_c8000000 (and similar high addresses on subsequent fault occurrences) which doesn't map. After the fault, the GPU context hits Xid 109 CTX SWITCH TIMEOUT and subsequent cuMemset / cuStreamSynchronize calls return CUDA_ERROR_ILLEGAL_ADDRESS because the context is poisoned. The host stack at the surfaced symptom shows Ndarray::write_float -> Ndarray::write -> allocate_memory_unique (small staging buffer) -> cuMemset failing because the context is already corrupted by the earlier OOB write.

So the bug is real - some CUDA kernel writes out-of-bounds. The cuMemset illegal address we see in CI logs is a downstream symptom. We need compute-sanitizer to identify the offending kernel.

What this workflow does

Single-job workflow that lives on duburcqa/debug_metal_grad_repro. All other workflows on this branch are renamed to *.yml.disabled so this is the only check on PR sync.

  1. Build editable on the same gpu-t4-4-core runner that surfaces the bug. CUDA-only (-DQD_WITH_VULKAN=OFF -DQD_WITH_CUDA=ON -DQD_WITH_AMDGPU=OFF -DQD_BUILD_TESTS=OFF) so the build is fast. pip install -e . --no-build-isolation so the source tree + _skbuild/*/cmake-build/ survive into the tmate session - edit a .cpp, then cd _skbuild/*/cmake-build && cmake --build . and the editable install picks the new .so up.
  2. Pre-install nvidia-cuda-toolkit so compute-sanitizer is available.
  3. Capture system / driver / HMM info + a pre-test dmesg snapshot.
  4. Run the reproducer: pytest -n 8 --count 50 tests/python/test_ad_ndarray.py -k cuda, CUDA_LAUNCH_BLOCKING=1, --tb=long -ra -v. continue-on-error: true so the workflow proceeds even on (expected) failure.
  5. Snapshot dmesg again so the new Xid 31 / Xid 109 lines are visible in the artifact diff.
  6. Run compute-sanitizer --tool memcheck over pytest --lf (the previously-failed tests). The first Invalid write block in the captured log is the offending kernel.
  7. Tmate session AFTER reproducer + sanitizer (blocking, limit-access-to-actor: true). When you attach, /tmp/repro/ already contains reproducer.log, sanitizer.log, sanitizer_summary.txt, xid_added.log, dmesg_*.log, system_identity.txt. The tmate session is for iterating on a fix, not for hunting the failure.
  8. Upload /tmp/repro/ as the debug_cuda_repro_artifacts artifact + final dmesg Xid count summary.

Inputs (workflow_dispatch)

  • quadrants_ref (default efd3f69abba15b574bf729ae11239d2a203a5f40, the HEAD of duburcqa/adstack_max_reducer_shader without the speculative fix).
  • tmate_timeout_minutes (default 60). touch /tmp/continue from inside the session to skip the rest of the wait.
  • reproduce_count (default 50, the --count value passed to pytest-repeat).

Inside the tmate session

Helpers staged at /tmp/repro/:

  • run_reproducer.sh [count] - re-run the reproducer.
  • run_sanitizer.sh - re-run compute-sanitizer over pytest --lf.
  • repro_cuda.py - standalone test_ad_fibonacci with faulthandler.dump_traceback_later. Useful for py-spy --native --locals from a second tmate window.
  • README.md - the cheatsheet for iterating (edit cpp, cmake --build, retest).

Do not merge

Once the offending kernel is identified and fixed on PR #635, this branch + workflow file get deleted.

@duburcqa duburcqa force-pushed the duburcqa/debug_metal_grad_repro branch 5 times, most recently from 3dc96dd to 73990be Compare May 1, 2026 10:37
@duburcqa duburcqa closed this May 1, 2026
@duburcqa duburcqa reopened this May 6, 2026
@duburcqa duburcqa force-pushed the duburcqa/debug_metal_grad_repro branch 3 times, most recently from 9d1e436 to 27caaf9 Compare May 7, 2026 09:30
@duburcqa duburcqa changed the title DO NOT MERGE: debug auto-diff on Apple M1. DO NOT MERGE: debug CUDA T4 max-reducer hang on PR #635 May 7, 2026
@duburcqa duburcqa force-pushed the duburcqa/debug_metal_grad_repro branch 2 times, most recently from 4224316 to dbfbcef Compare May 7, 2026 10:57
@duburcqa duburcqa changed the title DO NOT MERGE: debug CUDA T4 max-reducer hang on PR #635 DO NOT MERGE: debug CUDA T4 max-reducer hang on PR #635 (Xid 31) May 7, 2026
@duburcqa duburcqa force-pushed the duburcqa/debug_metal_grad_repro branch 2 times, most recently from 11e66fa to fda6a24 Compare May 7, 2026 12:32
…ranch, -s on sanitizer pytest, FAILED-streaming abort wrapper, tmate detached BEFORE reproducer, drop output-truncating head/tail pipes
@duburcqa duburcqa force-pushed the duburcqa/debug_metal_grad_repro branch from fda6a24 to 2151797 Compare May 7, 2026 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant