fix(multi_review): pre-compute diff so reviewers can't hallucinate clone failures#30
Merged
Merged
Conversation
Reviewer agents in the multi_review fan-out were hallucinating "clone
missing" / "ref not found" failures rather than running `git diff
<base>..<head>` themselves. Pre-computing the diff once on the remote
and pointing reviewers at the resulting file is the durable fix.
Adds PreComputeDiff(host, repo, workdir, base, head) which SSHs to the
remote, runs `git -C <repo> diff <base>..<head> > <work>/diff.txt`
followed by `wc -c` and `wc -l` for size/line stats, and returns a
PreComputeDiffResult with the diff path and counts.
Failure semantics:
- Missing inputs (especially BaseRef — working-tree mode is intentionally
unsupported) → validation error, no SSH call
- Non-zero git exit (bad ref) → wraps stderr in error so the orchestrator
can hard-stop with a clear message
- Context deadline → propagated
- Zero-byte diff → result.Empty=true, no error; caller decides
Parser tolerates both BSD wc output (" 1234 path") and GNU wc output
("1234 path") via first-numeric-token extraction. Both formats explicitly
covered by unit tests.
Coverage: 83% of statements on the new file (full file path coverage on
the happy path + 7 edge cases). No real SSH or openclaw calls — all
subprocess invocations mocked via the existing execCommandContext seam.
… to 1200s
Three changes to runMultiReview, one new prompt body, one default change,
plus the tests that drive them all.
1. preComputeDiffFn indirection added alongside shipBundleFn /
invokeReviewerFn. Production points at multireview.PreComputeDiff;
tests swap via withMockPreComputeDiff helper.
2. runMultiReview now calls preComputeDiffFn after shipBundleFn returns.
Failure hard-stops the run before any reviewer is invoked. Verified
by TestMultiReview_DiffFailureHardStops which sets an invokeReviewerFn
that fails loud if called.
3. --base is now required. Working-tree mode is intentionally unsupported
— it was the root cause of the failure in production where reviewers
hallucinated "clone missing" rather than navigating the working tree.
Verified by TestMultiReview_EmptyBaseRefRejected which asserts
shipBundleFn is never reached.
4. buildDefaultTaskMessage rewritten to accept PreComputeDiffResult and
produce a prompt that:
- Points reviewers at the pre-computed diff path
- Tells them to `cat <path>` (one trivial exec call, hard to hallucinate around)
- Includes an anti-hallucination clause: "do NOT report 'clone failed'
unless `ls <path>` proves it"
- Appends a `git diff --stat` hint when SizeBytes > 1_000_000
5. --per-reviewer-timeout-seconds default bumped 600 → 1200. Production
observation: Greta and Dax timed out at 600s on a real sprint. 1200s
matches the existing --timeout-seconds (total) default.
All four existing TestMultiReview_* tests updated to pass --base v1
and call withMockPreComputeDiff. New tests:
- TestMultiReview_DiffFailureHardStops
- TestMultiReview_EmptyBaseRefRejected
- TestMultiReview_DiffPathInTaskMessage
- TestMultiReview_LargeDiffWarning
- TestMultiReview_SmallDiffNoWarning
No real openclaw / SSH calls in any test path — all subprocess execution
goes through the existing execCommandContext mock seam.
…1200s timeout
CLI long-help string, command reference, and CHANGELOG entry all updated
to reflect the diff-precompute fix:
- multi_review --help: explains the pre-compute step and its rationale
("weaker reviewers hallucinate failures rather than persist"), lists
--base as REQUIRED, and adds the new failure mode (diff-precompute
hard-stop) to the failure-semantics block
- --base flag help: marks it REQUIRED and explains it drives the
pre-compute, not just informational task-message decoration
- llm-support-commands.md: moves --base from optional to required flag
table, adds a dedicated "Pre-compute step" section explaining the
SSH command that runs, the file location, and the >1MB --stat hint
behavior. Bumps the --per-reviewer-timeout-seconds default from 600
to 1200 in the docs to match the code change.
- CHANGELOG: adds a Fixed entry under Unreleased explaining what
changed and why (production observation, anti-hallucination
guard rails, required --base rationale, timeout bump).
5 tasks
samestrin
added a commit
that referenced
this pull request
May 13, 2026
…#32) * feat(multireview): ContainerExec helper for running commands inside the gateway container Wraps: ssh <host> -- docker exec <c> sh -c "$(echo <b64> | base64 -d)" The base64 round-trip removes all quoting concerns for the caller's command. Mirrors SSHRun semantics: non-zero exit is data, not error; returned error means the call could not be executed. 8 unit tests via the execCommandContext swap. No real SSH, no real docker. 100% coverage of the new function. Needed because the openclaw-gateway container has no /tmp bind mount, so prior runs (PR #29, PR #30) wrote bundles and diffs to host /tmp where reviewers running inside the container could never see them. Subsequent commits will route ShipBundle and PreComputeDiff through this helper. * feat(multireview): ship bundle into container via staging + docker cp The openclaw-gateway container has no /tmp bind mount, so the previous sequence (mkdir + scp + clone, all on host) created files reviewers could never see. New sequence: 1. ssh -- docker exec mkdir <container workdir> 2. ssh -- mkdir <host staging dir> 3. scp bundle.git -> host staging 4. ssh -- docker cp <host staging>/bundle.git <container>:<workdir>/ 5. ssh -- docker exec git clone (inside container workdir) ShipBundleParams gains: - GatewayContainer (defaults to openclaw-gateway) - HostStagingDir (required; scp landing pad) ShipBundleResult.RemoteRepoPath now returns a container-local path. RemoteBundlePath was renamed to HostStagingBundlePath since the bundle on disk lives on the host, not the container. 7 bundle tests verify the 5-call sequence, container defaults, field-required validations, and per-step failure modes (container-mkdir fail, docker-cp fail, container-clone fail). The RemoteRepoPath-is-container-path invariant is asserted directly. multi_review_test.go mock updated to the renamed field. All existing tests still pass. * feat(multireview): route PreComputeDiff through the gateway container git diff + wc now run inside the openclaw-gateway container via ContainerExec, so the resulting diff.txt lands in the container's overlay /tmp where reviewers can actually see it. Writing to the host's /tmp (PR #30's approach) put the file in a filesystem the container couldn't read. PreComputeDiffParams gains GatewayContainer (defaults to "openclaw-gateway"). Inner pipeline shape and wc parsing are unchanged. 2 new tests: - TestPreComputeDiff_RoutedThroughContainer: outer ssh contains 'docker exec', container name, and 'base64 -d'; decoded inner is still 'git ... diff > diff.txt && wc -c && wc -l' - TestPreComputeDiff_DefaultsContainerName: empty GatewayContainer defaults via ContainerExec Existing tests updated: HeadRefDefaultsToHEAD now decodes the base64 inner payload to find the quoted 'HEAD' token (the outer ssh args no longer contain it in clear text). Coverage 91.3% on PreComputeDiff. * feat(multi_review): thread gateway-container through, split cleanup in two runMultiReview now passes: - GatewayContainer into BOTH ShipBundleParams and PreComputeDiffParams - HostStagingDir into ShipBundleParams as /tmp/multi-review-staging-<ts> Cleanup is now two defers (LIFO): 1. ContainerExec rm -rf <container workdir> via containerExecFn 2. SSHRun rm -rf <host staging dir> via sshRunFn Both honor --skip-cleanup. The new sshRunFn and containerExecFn package vars let tests assert on cleanup commands deterministically. 4 new tests: - GatewayContainerFlagsThreaded: --gateway-container value reaches both ShipBundle and PreComputeDiff (different default name verifies the thread isn't just defaulting both ends) - HostStagingDirThreadsToShipBundle: HostStagingDir is set, has the /tmp/multi-review-staging- prefix, and differs from RemoteWorkdir - BothCleanupsFire: on success, container rm -rf <workdir> and host rm -rf <staging> both run - SkipCleanupSkipsBoth: --skip-cleanup suppresses both channels Coverage on runMultiReview: 94.3%. * docs(multi_review): document container-routed filesystem ops + CHANGELOG entry docs/llm-support-commands.md: replaces the pre-compute-step section with a 6-step container-routed pipeline (mkdir container + mkdir host staging + scp + docker cp + git clone in container + diff pre-compute via docker exec). Adds failure-semantics rows for the new container failure modes. CHANGELOG.md: new Unreleased fix entry under llm-support describing the host/container filesystem boundary bug and the ContainerExec / ShipBundle / PreComputeDiff refactor. Includes the breaking rename of ShipBundleResult.RemoteBundlePath -> HostStagingBundlePath. * fix(multi_review): register cleanup defers before ShipBundle Adversarial review surfaced a pre-existing (PR #29) bug: if ShipBundle fails partway (e.g. container mkdir succeeds, then docker cp fails), the container workdir leaks because cleanup defers were registered only AFTER ShipBundle returned success. Fix: register both cleanup defers BEFORE the ShipBundle call. rm -rf on a path that was never created is a harmless no-op, so this is safe. New test TestMultiReview_CleanupFiresOnShipBundleFailure verifies both cleanup channels fire even when ShipBundle returns an error. Also cleaned up dead code in TestShipBundle_ContainerCloneFails (the test had an unused first execCommandContext mock immediately overwritten by a second). And tightened the --help text to mention container routing and clarify the diff lives inside the container, not on the remote host.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Real-world use against a 5500-line sprint surfaced two failures in the multi-reviewer flow shipped in #29:
status: error, "Repository clone missing"without ever runningls. Direct openclaw probe confirmed the clone was present and readable — the reviewer fabricated a failure rather than persist through the multi-stepcd <path> && git diff <base>..<head>instruction.Fix
Move diff visibility from "reviewer runs git diff" to "orchestrator pre-computes once, reviewer runs
cat":multireview.PreComputeDiffhelper SSHes to the remote and runsgit -C <repo> diff <base>..<head> > <work>/diff.txt && wc -c && wc -l. Returns path, size, line count.runMultiReviewcallspreComputeDiffFnaftershipBundleFn; any failure hard-stops before any reviewer is invoked.--baseis now required (working-tree fallback removed — observed to fail).ls <diffPath>actually shows the file is absent"). Large diffs (>1MB) get an appended--stathint.--per-reviewer-timeout-secondsdefault bumped 600 → 1200.Test plan
go test ./internal/support/multireview/...— 9 cases for PreComputeDiff (success, required inputs, bad ref, empty diff, large diff, BSD/GNU wc, timeout, HEAD default). 83% coverage on the new file.go test ./internal/support/commands/...— 5 new cases (diff-failure hard-stop, empty-base rejection, diff path in task message, large-diff warning, small-diff no-warning) plus 4 existing tests updated to pass--base.execCommandContextswap — no real SSH, no real openclaw, no campaigns.yamtorah.estr.in/sprint/1.1.2-I_tier_c_classifier_microservicewith--reviewers bruce,greta,kai,mira,dax,otto --base 10054c8 --head 9326759. Pass criteria: 5+/6 reviewers returnstatus: okwithtdLineCount > 0; total findings > 30; no reviewer reply contains "clone missing" / "directory doesn't exist" / "refs not found".Files changed
New
internal/support/multireview/diff.gointernal/support/multireview/diff_test.goModified
internal/support/commands/multi_review.go+ testdocs/llm-support-commands.mdCHANGELOG.md