Skip to content

fix(multi_review): pre-compute diff so reviewers can't hallucinate clone failures#30

Merged
samestrin merged 3 commits into
mainfrom
fix/multi-review-diff-precompute
May 13, 2026
Merged

fix(multi_review): pre-compute diff so reviewers can't hallucinate clone failures#30
samestrin merged 3 commits into
mainfrom
fix/multi-review-diff-precompute

Conversation

@samestrin
Copy link
Copy Markdown
Owner

Summary

Real-world use against a 5500-line sprint surfaced two failures in the multi-reviewer flow shipped in #29:

  1. Hallucinated infrastructure failures. Bruce (qwen-3.6-plus) replied status: error, "Repository clone missing" without ever running ls. Direct openclaw probe confirmed the clone was present and readable — the reviewer fabricated a failure rather than persist through the multi-step cd <path> && git diff <base>..<head> instruction.
  2. 600s timeout too tight. Greta and Dax exceeded the default on the same sprint; 1200s aligns with the existing total-timeout default.

Fix

Move diff visibility from "reviewer runs git diff" to "orchestrator pre-computes once, reviewer runs cat":

  • New multireview.PreComputeDiff helper SSHes to the remote and runs git -C <repo> diff <base>..<head> > <work>/diff.txt && wc -c && wc -l. Returns path, size, line count.
  • runMultiReview calls preComputeDiffFn after shipBundleFn; any failure hard-stops before any reviewer is invoked.
  • --base is now required (working-tree fallback removed — observed to fail).
  • Task message rewritten: tells reviewer the exact diff path, the size, and includes an anti-hallucination clause ("Do NOT report 'repository missing' / 'clone failed' unless ls <diffPath> actually shows the file is absent"). Large diffs (>1MB) get an appended --stat hint.
  • --per-reviewer-timeout-seconds default bumped 600 → 1200.

Test plan

  • go test ./internal/support/multireview/... — 9 cases for PreComputeDiff (success, required inputs, bad ref, empty diff, large diff, BSD/GNU wc, timeout, HEAD default). 83% coverage on the new file.
  • go test ./internal/support/commands/... — 5 new cases (diff-failure hard-stop, empty-base rejection, diff path in task message, large-diff warning, small-diff no-warning) plus 4 existing tests updated to pass --base.
  • All subprocess calls mocked via execCommandContext swap — no real SSH, no real openclaw, no campaigns.
  • Manual smoke test post-merge against yamtorah.estr.in/sprint/1.1.2-I_tier_c_classifier_microservice with --reviewers bruce,greta,kai,mira,dax,otto --base 10054c8 --head 9326759. Pass criteria: 5+/6 reviewers return status: ok with tdLineCount > 0; total findings > 30; no reviewer reply contains "clone missing" / "directory doesn't exist" / "refs not found".

Files changed

New

  • internal/support/multireview/diff.go
  • internal/support/multireview/diff_test.go

Modified

  • internal/support/commands/multi_review.go + test
  • docs/llm-support-commands.md
  • CHANGELOG.md

samestrin added 3 commits May 13, 2026 12:33
Reviewer agents in the multi_review fan-out were hallucinating "clone
missing" / "ref not found" failures rather than running `git diff
<base>..<head>` themselves. Pre-computing the diff once on the remote
and pointing reviewers at the resulting file is the durable fix.

Adds PreComputeDiff(host, repo, workdir, base, head) which SSHs to the
remote, runs `git -C <repo> diff <base>..<head> > <work>/diff.txt`
followed by `wc -c` and `wc -l` for size/line stats, and returns a
PreComputeDiffResult with the diff path and counts.

Failure semantics:
  - Missing inputs (especially BaseRef — working-tree mode is intentionally
    unsupported) → validation error, no SSH call
  - Non-zero git exit (bad ref) → wraps stderr in error so the orchestrator
    can hard-stop with a clear message
  - Context deadline → propagated
  - Zero-byte diff → result.Empty=true, no error; caller decides

Parser tolerates both BSD wc output ("  1234 path") and GNU wc output
("1234 path") via first-numeric-token extraction. Both formats explicitly
covered by unit tests.

Coverage: 83% of statements on the new file (full file path coverage on
the happy path + 7 edge cases). No real SSH or openclaw calls — all
subprocess invocations mocked via the existing execCommandContext seam.
… to 1200s

Three changes to runMultiReview, one new prompt body, one default change,
plus the tests that drive them all.

1. preComputeDiffFn indirection added alongside shipBundleFn /
   invokeReviewerFn. Production points at multireview.PreComputeDiff;
   tests swap via withMockPreComputeDiff helper.

2. runMultiReview now calls preComputeDiffFn after shipBundleFn returns.
   Failure hard-stops the run before any reviewer is invoked. Verified
   by TestMultiReview_DiffFailureHardStops which sets an invokeReviewerFn
   that fails loud if called.

3. --base is now required. Working-tree mode is intentionally unsupported
   — it was the root cause of the failure in production where reviewers
   hallucinated "clone missing" rather than navigating the working tree.
   Verified by TestMultiReview_EmptyBaseRefRejected which asserts
   shipBundleFn is never reached.

4. buildDefaultTaskMessage rewritten to accept PreComputeDiffResult and
   produce a prompt that:
   - Points reviewers at the pre-computed diff path
   - Tells them to `cat <path>` (one trivial exec call, hard to hallucinate around)
   - Includes an anti-hallucination clause: "do NOT report 'clone failed'
     unless `ls <path>` proves it"
   - Appends a `git diff --stat` hint when SizeBytes > 1_000_000

5. --per-reviewer-timeout-seconds default bumped 600 → 1200. Production
   observation: Greta and Dax timed out at 600s on a real sprint. 1200s
   matches the existing --timeout-seconds (total) default.

All four existing TestMultiReview_* tests updated to pass --base v1
and call withMockPreComputeDiff. New tests:
- TestMultiReview_DiffFailureHardStops
- TestMultiReview_EmptyBaseRefRejected
- TestMultiReview_DiffPathInTaskMessage
- TestMultiReview_LargeDiffWarning
- TestMultiReview_SmallDiffNoWarning

No real openclaw / SSH calls in any test path — all subprocess execution
goes through the existing execCommandContext mock seam.
…1200s timeout

CLI long-help string, command reference, and CHANGELOG entry all updated
to reflect the diff-precompute fix:

- multi_review --help: explains the pre-compute step and its rationale
  ("weaker reviewers hallucinate failures rather than persist"), lists
  --base as REQUIRED, and adds the new failure mode (diff-precompute
  hard-stop) to the failure-semantics block

- --base flag help: marks it REQUIRED and explains it drives the
  pre-compute, not just informational task-message decoration

- llm-support-commands.md: moves --base from optional to required flag
  table, adds a dedicated "Pre-compute step" section explaining the
  SSH command that runs, the file location, and the >1MB --stat hint
  behavior. Bumps the --per-reviewer-timeout-seconds default from 600
  to 1200 in the docs to match the code change.

- CHANGELOG: adds a Fixed entry under Unreleased explaining what
  changed and why (production observation, anti-hallucination
  guard rails, required --base rationale, timeout bump).
@samestrin samestrin merged commit 51968c6 into main May 13, 2026
8 checks passed
@samestrin samestrin deleted the fix/multi-review-diff-precompute branch May 13, 2026 20:02
samestrin added a commit that referenced this pull request May 13, 2026
…#32)

* feat(multireview): ContainerExec helper for running commands inside the gateway container

Wraps:  ssh <host> -- docker exec <c> sh -c "$(echo <b64> | base64 -d)"

The base64 round-trip removes all quoting concerns for the caller's
command. Mirrors SSHRun semantics: non-zero exit is data, not error;
returned error means the call could not be executed.

8 unit tests via the execCommandContext swap. No real SSH, no real
docker. 100% coverage of the new function.

Needed because the openclaw-gateway container has no /tmp bind mount,
so prior runs (PR #29, PR #30) wrote bundles and diffs to host /tmp
where reviewers running inside the container could never see them.
Subsequent commits will route ShipBundle and PreComputeDiff through
this helper.

* feat(multireview): ship bundle into container via staging + docker cp

The openclaw-gateway container has no /tmp bind mount, so the previous
sequence (mkdir + scp + clone, all on host) created files reviewers
could never see. New sequence:

  1. ssh -- docker exec mkdir <container workdir>
  2. ssh -- mkdir <host staging dir>
  3. scp bundle.git -> host staging
  4. ssh -- docker cp <host staging>/bundle.git <container>:<workdir>/
  5. ssh -- docker exec git clone (inside container workdir)

ShipBundleParams gains:
- GatewayContainer (defaults to openclaw-gateway)
- HostStagingDir (required; scp landing pad)

ShipBundleResult.RemoteRepoPath now returns a container-local path.
RemoteBundlePath was renamed to HostStagingBundlePath since the bundle
on disk lives on the host, not the container.

7 bundle tests verify the 5-call sequence, container defaults,
field-required validations, and per-step failure modes
(container-mkdir fail, docker-cp fail, container-clone fail). The
RemoteRepoPath-is-container-path invariant is asserted directly.

multi_review_test.go mock updated to the renamed field. All existing
tests still pass.

* feat(multireview): route PreComputeDiff through the gateway container

git diff + wc now run inside the openclaw-gateway container via
ContainerExec, so the resulting diff.txt lands in the container's
overlay /tmp where reviewers can actually see it. Writing to the
host's /tmp (PR #30's approach) put the file in a filesystem the
container couldn't read.

PreComputeDiffParams gains GatewayContainer (defaults to
"openclaw-gateway"). Inner pipeline shape and wc parsing are
unchanged.

2 new tests:
- TestPreComputeDiff_RoutedThroughContainer: outer ssh contains
  'docker exec', container name, and 'base64 -d'; decoded inner
  is still 'git ... diff > diff.txt && wc -c && wc -l'
- TestPreComputeDiff_DefaultsContainerName: empty GatewayContainer
  defaults via ContainerExec

Existing tests updated: HeadRefDefaultsToHEAD now decodes the
base64 inner payload to find the quoted 'HEAD' token (the outer
ssh args no longer contain it in clear text).

Coverage 91.3% on PreComputeDiff.

* feat(multi_review): thread gateway-container through, split cleanup in two

runMultiReview now passes:
- GatewayContainer into BOTH ShipBundleParams and PreComputeDiffParams
- HostStagingDir into ShipBundleParams as /tmp/multi-review-staging-<ts>

Cleanup is now two defers (LIFO):
1. ContainerExec rm -rf <container workdir>  via containerExecFn
2. SSHRun        rm -rf <host staging dir>   via sshRunFn

Both honor --skip-cleanup. The new sshRunFn and containerExecFn
package vars let tests assert on cleanup commands deterministically.

4 new tests:
- GatewayContainerFlagsThreaded: --gateway-container value reaches both
  ShipBundle and PreComputeDiff (different default name verifies the
  thread isn't just defaulting both ends)
- HostStagingDirThreadsToShipBundle: HostStagingDir is set, has the
  /tmp/multi-review-staging- prefix, and differs from RemoteWorkdir
- BothCleanupsFire: on success, container rm -rf <workdir> and host
  rm -rf <staging> both run
- SkipCleanupSkipsBoth: --skip-cleanup suppresses both channels

Coverage on runMultiReview: 94.3%.

* docs(multi_review): document container-routed filesystem ops + CHANGELOG entry

docs/llm-support-commands.md: replaces the pre-compute-step section
with a 6-step container-routed pipeline (mkdir container + mkdir host
staging + scp + docker cp + git clone in container + diff pre-compute
via docker exec). Adds failure-semantics rows for the new container
failure modes.

CHANGELOG.md: new Unreleased fix entry under llm-support describing
the host/container filesystem boundary bug and the ContainerExec /
ShipBundle / PreComputeDiff refactor. Includes the breaking rename
of ShipBundleResult.RemoteBundlePath -> HostStagingBundlePath.

* fix(multi_review): register cleanup defers before ShipBundle

Adversarial review surfaced a pre-existing (PR #29) bug: if ShipBundle
fails partway (e.g. container mkdir succeeds, then docker cp fails),
the container workdir leaks because cleanup defers were registered
only AFTER ShipBundle returned success.

Fix: register both cleanup defers BEFORE the ShipBundle call. rm -rf
on a path that was never created is a harmless no-op, so this is safe.

New test TestMultiReview_CleanupFiresOnShipBundleFailure verifies
both cleanup channels fire even when ShipBundle returns an error.

Also cleaned up dead code in TestShipBundle_ContainerCloneFails (the
test had an unused first execCommandContext mock immediately
overwritten by a second). And tightened the --help text to mention
container routing and clarify the diff lives inside the container,
not on the remote host.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant