DO NOT MERGE: debug CUDA T4 max-reducer hang on PR #635 (Xid 31) by duburcqa · Pull Request #589 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-04-29T10:04:26Z

Debug-only PR. Reproduces and pinpoints the CUDA T4 failure observed on PR #635 (duburcqa/adstack_max_reducer_shader).

What we found in the previous tmate session

The failure reproduces deterministically on gpu-t4-4-core (Tesla T4, driver 590.48.01 open kernel module, HMM-capable) under pytest -n 8 --count 50 tests/python/test_ad_ndarray.py -k cuda. About one-in-N workers crashes on its first or second launched test with:

[E cuda_driver.h:operator()@93] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS while calling cuStreamSynchronize
[E cuda_driver.h:operator()@93] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS while calling cuMemFreeAsync
[E cuda_driver.h:operator()@93] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS while calling cuMemsetD8_v2
terminate called after throwing an instance of 'std::__cxx11::basic_string'

dmesg confirms the underlying cause is a GPU MMU page fault on a kernel write:

NVRM: Xid (PCI:0001:00:00): 31, pid=37803, name=[pytest-xdist r, channel 0x00000201, intr 00000000.
  MMU Fault: ENGINE GRAPHICS GPC0 GPCCLIENT_T1_0 faulted @ 0x70a2_c8000000.
  Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE
NVRM: Xid (PCI:0001:00:00): 109, pid=37803, ... CTX SWITCH TIMEOUT

A CUDA kernel writes to virtual address 0x70a2_c8000000 (and similar high addresses on subsequent fault occurrences) which doesn't map. After the fault, the GPU context hits Xid 109 CTX SWITCH TIMEOUT and subsequent cuMemset / cuStreamSynchronize calls return CUDA_ERROR_ILLEGAL_ADDRESS because the context is poisoned. The host stack at the surfaced symptom shows Ndarray::write_float -> Ndarray::write -> allocate_memory_unique (small staging buffer) -> cuMemset failing because the context is already corrupted by the earlier OOB write.

So the bug is real - some CUDA kernel writes out-of-bounds. The cuMemset illegal address we see in CI logs is a downstream symptom. We need compute-sanitizer to identify the offending kernel.

What this workflow does

Single-job workflow that lives on duburcqa/debug_metal_grad_repro. All other workflows on this branch are renamed to *.yml.disabled so this is the only check on PR sync.

Build editable on the same gpu-t4-4-core runner that surfaces the bug. CUDA-only (-DQD_WITH_VULKAN=OFF -DQD_WITH_CUDA=ON -DQD_WITH_AMDGPU=OFF -DQD_BUILD_TESTS=OFF) so the build is fast. pip install -e . --no-build-isolation so the source tree + _skbuild/*/cmake-build/ survive into the tmate session - edit a .cpp, then cd _skbuild/*/cmake-build && cmake --build . and the editable install picks the new .so up.
Pre-install nvidia-cuda-toolkit so compute-sanitizer is available.
Capture system / driver / HMM info + a pre-test dmesg snapshot.
Run the reproducer: pytest -n 8 --count 50 tests/python/test_ad_ndarray.py -k cuda, CUDA_LAUNCH_BLOCKING=1, --tb=long -ra -v. continue-on-error: true so the workflow proceeds even on (expected) failure.
Snapshot dmesg again so the new Xid 31 / Xid 109 lines are visible in the artifact diff.
Run compute-sanitizer --tool memcheck over pytest --lf (the previously-failed tests). The first Invalid write block in the captured log is the offending kernel.
Tmate session AFTER reproducer + sanitizer (blocking, limit-access-to-actor: true). When you attach, /tmp/repro/ already contains reproducer.log, sanitizer.log, sanitizer_summary.txt, xid_added.log, dmesg_*.log, system_identity.txt. The tmate session is for iterating on a fix, not for hunting the failure.
Upload /tmp/repro/ as the debug_cuda_repro_artifacts artifact + final dmesg Xid count summary.

Inputs (`workflow_dispatch`)

quadrants_ref (default efd3f69abba15b574bf729ae11239d2a203a5f40, the HEAD of duburcqa/adstack_max_reducer_shader without the speculative fix).
tmate_timeout_minutes (default 60). touch /tmp/continue from inside the session to skip the rest of the wait.
reproduce_count (default 50, the --count value passed to pytest-repeat).

Inside the tmate session

Helpers staged at /tmp/repro/:

run_reproducer.sh [count] - re-run the reproducer.
run_sanitizer.sh - re-run compute-sanitizer over pytest --lf.
repro_cuda.py - standalone test_ad_fibonacci with faulthandler.dump_traceback_later. Useful for py-spy --native --locals from a second tmate window.
README.md - the cheatsheet for iterating (edit cpp, cmake --build, retest).

Do not merge

Once the offending kernel is identified and fixed on PR #635, this branch + workflow file get deleted.

…mesg Xid capture, tmate AFTER repro

…ranch, -s on sanitizer pytest, FAILED-streaming abort wrapper, tmate detached BEFORE reproducer, drop output-truncating head/tail pipes

…iment to test the nullptr-stream pin on T4

duburcqa force-pushed the duburcqa/debug_metal_grad_repro branch 5 times, most recently from 3dc96dd to 73990be Compare May 1, 2026 10:37

duburcqa closed this May 1, 2026

duburcqa reopened this May 6, 2026

duburcqa force-pushed the duburcqa/debug_metal_grad_repro branch 3 times, most recently from 9d1e436 to 27caaf9 Compare May 7, 2026 09:30

duburcqa changed the title ~~DO NOT MERGE: debug auto-diff on Apple M1.~~ DO NOT MERGE: debug CUDA T4 max-reducer hang on PR #635 May 7, 2026

duburcqa force-pushed the duburcqa/debug_metal_grad_repro branch 2 times, most recently from 4224316 to dbfbcef Compare May 7, 2026 10:57

duburcqa changed the title ~~DO NOT MERGE: debug CUDA T4 max-reducer hang on PR #635~~ DO NOT MERGE: debug CUDA T4 max-reducer hang on PR #635 (Xid 31) May 7, 2026

[CI] debug_cuda_max_reducer: single-job editable build, sanitizer + d…

afb8d86

…mesg Xid capture, tmate AFTER repro

duburcqa force-pushed the duburcqa/debug_metal_grad_repro branch 2 times, most recently from 11e66fa to fda6a24 Compare May 7, 2026 12:32

[CI] debug_cuda_max_reducer: install OS deps, default ref to source b…

2151797

…ranch, -s on sanitizer pytest, FAILED-streaming abort wrapper, tmate detached BEFORE reproducer, drop output-truncating head/tail pipes

duburcqa force-pushed the duburcqa/debug_metal_grad_repro branch from fda6a24 to 2151797 Compare May 7, 2026 13:06

[CI] Point debug_cuda_max_reducer at duburcqa/turing_stream_pin_exper…

f86bba0

…iment to test the nullptr-stream pin on T4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DO NOT MERGE: debug CUDA T4 max-reducer hang on PR #635 (Xid 31)#589

DO NOT MERGE: debug CUDA T4 max-reducer hang on PR #635 (Xid 31)#589
duburcqa wants to merge 3 commits intomainfrom
duburcqa/debug_metal_grad_repro

duburcqa commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

duburcqa commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What we found in the previous tmate session

What this workflow does

Inputs (workflow_dispatch)

Inside the tmate session

Do not merge

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

duburcqa commented Apr 29, 2026 •

edited

Loading

Inputs (`workflow_dispatch`)