fix(multi_review): pre-compute diff so reviewers can't hallucinate clone failures by samestrin · Pull Request #30 · samestrin/llm-tools

samestrin · 2026-05-13T19:44:54Z

Summary

Real-world use against a 5500-line sprint surfaced two failures in the multi-reviewer flow shipped in #29:

Hallucinated infrastructure failures. Bruce (qwen-3.6-plus) replied status: error, "Repository clone missing" without ever running ls. Direct openclaw probe confirmed the clone was present and readable — the reviewer fabricated a failure rather than persist through the multi-step cd <path> && git diff <base>..<head> instruction.
600s timeout too tight. Greta and Dax exceeded the default on the same sprint; 1200s aligns with the existing total-timeout default.

Fix

Move diff visibility from "reviewer runs git diff" to "orchestrator pre-computes once, reviewer runs cat":

New multireview.PreComputeDiff helper SSHes to the remote and runs git -C <repo> diff <base>..<head> > <work>/diff.txt && wc -c && wc -l. Returns path, size, line count.
runMultiReview calls preComputeDiffFn after shipBundleFn; any failure hard-stops before any reviewer is invoked.
--base is now required (working-tree fallback removed — observed to fail).
Task message rewritten: tells reviewer the exact diff path, the size, and includes an anti-hallucination clause ("Do NOT report 'repository missing' / 'clone failed' unless ls <diffPath> actually shows the file is absent"). Large diffs (>1MB) get an appended --stat hint.
--per-reviewer-timeout-seconds default bumped 600 → 1200.

Test plan

go test ./internal/support/multireview/... — 9 cases for PreComputeDiff (success, required inputs, bad ref, empty diff, large diff, BSD/GNU wc, timeout, HEAD default). 83% coverage on the new file.
go test ./internal/support/commands/... — 5 new cases (diff-failure hard-stop, empty-base rejection, diff path in task message, large-diff warning, small-diff no-warning) plus 4 existing tests updated to pass --base.
All subprocess calls mocked via execCommandContext swap — no real SSH, no real openclaw, no campaigns.
Manual smoke test post-merge against yamtorah.estr.in/sprint/1.1.2-I_tier_c_classifier_microservice with --reviewers bruce,greta,kai,mira,dax,otto --base 10054c8 --head 9326759. Pass criteria: 5+/6 reviewers return status: ok with tdLineCount > 0; total findings > 30; no reviewer reply contains "clone missing" / "directory doesn't exist" / "refs not found".

Files changed

New

internal/support/multireview/diff.go
internal/support/multireview/diff_test.go

Modified

internal/support/commands/multi_review.go + test
docs/llm-support-commands.md
CHANGELOG.md

Reviewer agents in the multi_review fan-out were hallucinating "clone missing" / "ref not found" failures rather than running `git diff <base>..<head>` themselves. Pre-computing the diff once on the remote and pointing reviewers at the resulting file is the durable fix. Adds PreComputeDiff(host, repo, workdir, base, head) which SSHs to the remote, runs `git -C <repo> diff <base>..<head> > <work>/diff.txt` followed by `wc -c` and `wc -l` for size/line stats, and returns a PreComputeDiffResult with the diff path and counts. Failure semantics: - Missing inputs (especially BaseRef — working-tree mode is intentionally unsupported) → validation error, no SSH call - Non-zero git exit (bad ref) → wraps stderr in error so the orchestrator can hard-stop with a clear message - Context deadline → propagated - Zero-byte diff → result.Empty=true, no error; caller decides Parser tolerates both BSD wc output (" 1234 path") and GNU wc output ("1234 path") via first-numeric-token extraction. Both formats explicitly covered by unit tests. Coverage: 83% of statements on the new file (full file path coverage on the happy path + 7 edge cases). No real SSH or openclaw calls — all subprocess invocations mocked via the existing execCommandContext seam.

… to 1200s Three changes to runMultiReview, one new prompt body, one default change, plus the tests that drive them all. 1. preComputeDiffFn indirection added alongside shipBundleFn / invokeReviewerFn. Production points at multireview.PreComputeDiff; tests swap via withMockPreComputeDiff helper. 2. runMultiReview now calls preComputeDiffFn after shipBundleFn returns. Failure hard-stops the run before any reviewer is invoked. Verified by TestMultiReview_DiffFailureHardStops which sets an invokeReviewerFn that fails loud if called. 3. --base is now required. Working-tree mode is intentionally unsupported — it was the root cause of the failure in production where reviewers hallucinated "clone missing" rather than navigating the working tree. Verified by TestMultiReview_EmptyBaseRefRejected which asserts shipBundleFn is never reached. 4. buildDefaultTaskMessage rewritten to accept PreComputeDiffResult and produce a prompt that: - Points reviewers at the pre-computed diff path - Tells them to `cat <path>` (one trivial exec call, hard to hallucinate around) - Includes an anti-hallucination clause: "do NOT report 'clone failed' unless `ls <path>` proves it" - Appends a `git diff --stat` hint when SizeBytes > 1_000_000 5. --per-reviewer-timeout-seconds default bumped 600 → 1200. Production observation: Greta and Dax timed out at 600s on a real sprint. 1200s matches the existing --timeout-seconds (total) default. All four existing TestMultiReview_* tests updated to pass --base v1 and call withMockPreComputeDiff. New tests: - TestMultiReview_DiffFailureHardStops - TestMultiReview_EmptyBaseRefRejected - TestMultiReview_DiffPathInTaskMessage - TestMultiReview_LargeDiffWarning - TestMultiReview_SmallDiffNoWarning No real openclaw / SSH calls in any test path — all subprocess execution goes through the existing execCommandContext mock seam.

…1200s timeout CLI long-help string, command reference, and CHANGELOG entry all updated to reflect the diff-precompute fix: - multi_review --help: explains the pre-compute step and its rationale ("weaker reviewers hallucinate failures rather than persist"), lists --base as REQUIRED, and adds the new failure mode (diff-precompute hard-stop) to the failure-semantics block - --base flag help: marks it REQUIRED and explains it drives the pre-compute, not just informational task-message decoration - llm-support-commands.md: moves --base from optional to required flag table, adds a dedicated "Pre-compute step" section explaining the SSH command that runs, the file location, and the >1MB --stat hint behavior. Bumps the --per-reviewer-timeout-seconds default from 600 to 1200 in the docs to match the code change. - CHANGELOG: adds a Fixed entry under Unreleased explaining what changed and why (production observation, anti-hallucination guard rails, required --base rationale, timeout bump).

…#32) * feat(multireview): ContainerExec helper for running commands inside the gateway container Wraps: ssh <host> -- docker exec <c> sh -c "$(echo <b64> | base64 -d)" The base64 round-trip removes all quoting concerns for the caller's command. Mirrors SSHRun semantics: non-zero exit is data, not error; returned error means the call could not be executed. 8 unit tests via the execCommandContext swap. No real SSH, no real docker. 100% coverage of the new function. Needed because the openclaw-gateway container has no /tmp bind mount, so prior runs (PR #29, PR #30) wrote bundles and diffs to host /tmp where reviewers running inside the container could never see them. Subsequent commits will route ShipBundle and PreComputeDiff through this helper. * feat(multireview): ship bundle into container via staging + docker cp The openclaw-gateway container has no /tmp bind mount, so the previous sequence (mkdir + scp + clone, all on host) created files reviewers could never see. New sequence: 1. ssh -- docker exec mkdir <container workdir> 2. ssh -- mkdir <host staging dir> 3. scp bundle.git -> host staging 4. ssh -- docker cp <host staging>/bundle.git <container>:<workdir>/ 5. ssh -- docker exec git clone (inside container workdir) ShipBundleParams gains: - GatewayContainer (defaults to openclaw-gateway) - HostStagingDir (required; scp landing pad) ShipBundleResult.RemoteRepoPath now returns a container-local path. RemoteBundlePath was renamed to HostStagingBundlePath since the bundle on disk lives on the host, not the container. 7 bundle tests verify the 5-call sequence, container defaults, field-required validations, and per-step failure modes (container-mkdir fail, docker-cp fail, container-clone fail). The RemoteRepoPath-is-container-path invariant is asserted directly. multi_review_test.go mock updated to the renamed field. All existing tests still pass. * feat(multireview): route PreComputeDiff through the gateway container git diff + wc now run inside the openclaw-gateway container via ContainerExec, so the resulting diff.txt lands in the container's overlay /tmp where reviewers can actually see it. Writing to the host's /tmp (PR #30's approach) put the file in a filesystem the container couldn't read. PreComputeDiffParams gains GatewayContainer (defaults to "openclaw-gateway"). Inner pipeline shape and wc parsing are unchanged. 2 new tests: - TestPreComputeDiff_RoutedThroughContainer: outer ssh contains 'docker exec', container name, and 'base64 -d'; decoded inner is still 'git ... diff > diff.txt && wc -c && wc -l' - TestPreComputeDiff_DefaultsContainerName: empty GatewayContainer defaults via ContainerExec Existing tests updated: HeadRefDefaultsToHEAD now decodes the base64 inner payload to find the quoted 'HEAD' token (the outer ssh args no longer contain it in clear text). Coverage 91.3% on PreComputeDiff. * feat(multi_review): thread gateway-container through, split cleanup in two runMultiReview now passes: - GatewayContainer into BOTH ShipBundleParams and PreComputeDiffParams - HostStagingDir into ShipBundleParams as /tmp/multi-review-staging-<ts> Cleanup is now two defers (LIFO): 1. ContainerExec rm -rf <container workdir> via containerExecFn 2. SSHRun rm -rf <host staging dir> via sshRunFn Both honor --skip-cleanup. The new sshRunFn and containerExecFn package vars let tests assert on cleanup commands deterministically. 4 new tests: - GatewayContainerFlagsThreaded: --gateway-container value reaches both ShipBundle and PreComputeDiff (different default name verifies the thread isn't just defaulting both ends) - HostStagingDirThreadsToShipBundle: HostStagingDir is set, has the /tmp/multi-review-staging- prefix, and differs from RemoteWorkdir - BothCleanupsFire: on success, container rm -rf <workdir> and host rm -rf <staging> both run - SkipCleanupSkipsBoth: --skip-cleanup suppresses both channels Coverage on runMultiReview: 94.3%. * docs(multi_review): document container-routed filesystem ops + CHANGELOG entry docs/llm-support-commands.md: replaces the pre-compute-step section with a 6-step container-routed pipeline (mkdir container + mkdir host staging + scp + docker cp + git clone in container + diff pre-compute via docker exec). Adds failure-semantics rows for the new container failure modes. CHANGELOG.md: new Unreleased fix entry under llm-support describing the host/container filesystem boundary bug and the ContainerExec / ShipBundle / PreComputeDiff refactor. Includes the breaking rename of ShipBundleResult.RemoteBundlePath -> HostStagingBundlePath. * fix(multi_review): register cleanup defers before ShipBundle Adversarial review surfaced a pre-existing (PR #29) bug: if ShipBundle fails partway (e.g. container mkdir succeeds, then docker cp fails), the container workdir leaks because cleanup defers were registered only AFTER ShipBundle returned success. Fix: register both cleanup defers BEFORE the ShipBundle call. rm -rf on a path that was never created is a harmless no-op, so this is safe. New test TestMultiReview_CleanupFiresOnShipBundleFailure verifies both cleanup channels fire even when ShipBundle returns an error. Also cleaned up dead code in TestShipBundle_ContainerCloneFails (the test had an unused first execCommandContext mock immediately overwritten by a second). And tightened the --help text to mention container routing and clarify the diff lives inside the container, not on the remote host.

samestrin added 3 commits May 13, 2026 12:33

samestrin merged commit 51968c6 into main May 13, 2026
8 checks passed

samestrin deleted the fix/multi-review-diff-precompute branch May 13, 2026 20:02

samestrin mentioned this pull request May 13, 2026

fix(multi_review): route filesystem ops through the gateway container #32

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(multi_review): pre-compute diff so reviewers can't hallucinate clone failures#30

fix(multi_review): pre-compute diff so reviewers can't hallucinate clone failures#30
samestrin merged 3 commits into
mainfrom
fix/multi-review-diff-precompute

samestrin commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samestrin commented May 13, 2026

Summary

Fix

Test plan

Files changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant