Skip to content

ci(engine): stop measuring perf-gate wall time under parallel contention#52

Merged
diegokingston merged 1 commit into
mainfrom
fix/ci-perf-gate-contention
Jun 10, 2026
Merged

ci(engine): stop measuring perf-gate wall time under parallel contention#52
diegokingston merged 1 commit into
mainfrom
fix/ci-perf-gate-contention

Conversation

@diegokingston

Copy link
Copy Markdown
Collaborator

Root cause: harmonic_3d_5x5_plate_under_15s failed at 33.8s in the "Run all tests" step (843/6777), where it ran concurrently with the entire suite (sparse_shell_gates were SLOW [>120s]/[>240s] at the same moment) on a shared 2-4 vCPU runner. The gate asserts on wall-clock elapsed time, so CPU contention from co-scheduled tests inflated a ~9s debug solve (verified locally: 8.99s isolated/serial) to 33.8s. No fixed threshold can fix this — bumping 15s->30s only deferred the next flake.

The repo already excludes two wall-time benchmarks (harmonic_phase_breakdown, harmonic_modal_vs_direct_timing) from the all() run for the same reason; the perf-gate binaries were missed.

Fix (workflow config, per the cd8beb9 convention):

  • Run the dedicated perf-gate steps single-threaded so elapsed time reflects CPU work, not contention.
  • Exclude perf_regression_advanced and perf_regression_gates from the all() run; they are fully validated in their dedicated steps, so re-running them under full-suite parallelism adds only flakiness.

Also bump the harmonic gate 15s->30s on main: serial debug was observed at 18.2s on a slow runner, too close to the old 15s bound. The solver itself has not regressed.

Root cause: harmonic_3d_5x5_plate_under_15s failed at 33.8s in the
"Run all tests" step (843/6777), where it ran concurrently with the
entire suite (sparse_shell_gates were SLOW [>120s]/[>240s] at the same
moment) on a shared 2-4 vCPU runner. The gate asserts on *wall-clock*
elapsed time, so CPU contention from co-scheduled tests inflated a ~9s
debug solve (verified locally: 8.99s isolated/serial) to 33.8s. No fixed
threshold can fix this — bumping 15s->30s only deferred the next flake.

The repo already excludes two wall-time benchmarks
(harmonic_phase_breakdown, harmonic_modal_vs_direct_timing) from the
all() run for the same reason; the perf-gate binaries were missed.

Fix (workflow config, per the cd8beb9 convention):
- Run the dedicated perf-gate steps single-threaded so elapsed time
  reflects CPU work, not contention.
- Exclude perf_regression_advanced and perf_regression_gates from the
  all() run; they are fully validated in their dedicated steps, so
  re-running them under full-suite parallelism adds only flakiness.

Also bump the harmonic gate 15s->30s on main: serial debug was observed
at 18.2s on a slow runner, too close to the old 15s bound. The solver
itself has not regressed.
@diegokingston diegokingston merged commit c4c6078 into main Jun 10, 2026
2 of 3 checks passed
diegokingston added a commit that referenced this pull request Jun 10, 2026
…-product-iteration

Conflict in engine/tests/perf_regression_advanced.rs: both sides bumped
the harmonic gate 15s->30s with different doc comments; kept main's,
which documents the single-threaded CI execution.
diegokingston added a commit that referenced this pull request Jun 10, 2026
'Run criterion benchmarks (quick)' timed out at 10 minutes on a slow
shared runner and turned the test job red with no correctness signal —
'Run all tests' had passed. The step is informational (reports are
uploaded as artifacts, with if: always()), so mark it
continue-on-error and give it a bit more headroom, following the same
rationale as the perf-gate contention fix (#52).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant