feat: multi_review subcommand + group_td REVIEWERS/CONFIDENCE columns#29
Merged
Conversation
…t coerce parseNumber was silently converting "4.30" → float 4.3 → "4.3" on emit. Now only coerces to float when the value round-trips through YAML marshal back to the exact input string. Adds regression tests for both the fix and the clean-numeric case to prevent over-correction.
First piece of the multi-reviewer pipeline. Wraps /usr/bin/ssh and /usr/bin/scp with context-driven timeouts, separate stdout/stderr capture (callers parse stdout JSON, stderr stays diagnostic), and explicit handling of non-zero remote exit codes as data rather than errors. Uses package-level execCommandContext indirection so tests inject scripted subprocess behavior via /bin/sh -c without depending on real SSH targets.
CreateBundle writes a full-history bundle (HEAD --branches --tags) so the remote clone can resolve any base tag without prerequisite gaps. Range-only bundles silently produce uncloneable artifacts; we learned this the hard way during the Bruce probe. ShipBundle orchestrates the full bundle → scp → remote clone sequence with shared timeout. Returns the remote clone path so reviewer invocations can read from a known location. Single-quote shell escaping for the workdir and repo name keeps directory names with spaces or unusual chars safe. End-to-end test against a real fixture repo verifies the bundle clones cleanly and resolves the original tag. Pipeline test validates the three SSH/SCP calls happen in the right order via execCommandContext mocking.
…lope
InvokeReviewer wraps the `docker exec openclaw-gateway openclaw agent --json`
call we worked out during the bench probes. The task message is base64
encoded for transport so any prompt content (quotes, newlines, code
fences, dollar signs) crosses the SSH boundary intact.
Parses the envelope shape we observed live:
{runId, status, summary, result: {payloads: [{text, mediaUrl}], meta: {durationMs, aborted, agentMeta: {model, sessionId, provider}}}}
Concatenates payload text into a single ReviewProse field (matches the
behavior of our run-bench.sh extraction) and preserves the raw JSON for
diagnosis. Distinguishes malformed JSON, empty response, SSH-level failure,
and non-zero exit codes — each surfaces with a different error message.
…viewer merge ExtractTDLines anchors on a strict regex (severity token followed immediately by a pipe at line start) so prose mentions of "HIGH severity" or similar are not mistaken for findings. CRITICAL/HIGH/MEDIUM/LOW all supported. WriteReviewerOutput drops four artifacts per reviewer under <root>/<agent>/: review.md — prose for human reading td-stream.txt — pipe-delimited findings with |<agent> appended status.json — model, duration, status, td count response.json — raw openclaw envelope (preserved for replay) MergeStreams concatenates per-reviewer streams into td-stream-all.txt with a header comment. Missing reviewer dirs are silently skipped — a reviewer might have failed to invoke and we proceed with what completed.
The top-level orchestrator. Wires together bundle/ship/invoke/merge into a
single command that /code-review can call when review.mode: multi is set.
Two-lane execution:
--reviewers all agents involved
--serial-reviewers subset that runs sequentially after the parallel lane
(for shared rate limits)
Failure semantics:
- Bundle/ship failure → hard-stop (no point continuing without diff
staged on remote)
- Per-reviewer failure → recorded in summary, other reviewers continue,
exit 0 with partial: true
- All reviewers fail → exit 1 with summary path
Test seams use package-level invokeReviewerFn and shipBundleFn variables
so tests inject mock behavior without spinning up SSH/openclaw. Cleanup
of the remote workdir happens via deferred SSH rm -rf with a tight
30s timeout so test failures or panics don't block.
Per-function coverage on multi_review.go: 67-100%, dominated by the
runMultiReview happy path at 93.7%.
Multi-reviewer code review produces findings with reviewer attribution
(which agents caught it) and confidence (how many agreed). These travel
through the TD stream and need to land in the TD README.md table.
Mirrors the existing SOURCE column feature-flag pattern: a column is
emitted only when at least one input row carries a non-empty value for it.
Callers that don't pass REVIEWERS/CONFIDENCE through --headers see the
same output they always saw — 7-column legacy, 8-column with Source, etc.
While here, refactor the markdown header + row writers to assemble cells
dynamically instead of enumerating every (checkbox × source) combination.
The old switch already had 4 cases; adding 2 more feature-flagged columns
would have ballooned it to 16. The dynamic builder preserves exact cell
spacing semantics (header empty-cell `| |` single-space, data empty-cell
`| |` two-space) so downstream column-position parsers see no change.
REVIEWERS is stored comma-joined in the stream ("bruce,greta") and
rendered with a space after each comma in the cell ("bruce, greta") for
readability.
CHANGELOG gets an Unreleased section covering multi_review, the group_td column additions, and the yaml parseNumber regression fix (separately committed but worth surfacing in the release notes). docs/llm-support-commands.md gets a multi_review reference entry under Technical Debt Management with flag tables, output layout, failure semantics, and three example invocations. The group-td section gets a small note about the new optional trailing columns.
…put root Renamed the top-level merged file from td-stream-all.txt to td-stream.txt so /reconcile-code-review's auto-discovery (any code-review/<source>/ dir with a td-stream.txt is a source) treats multi-agent as one unified source. The per-reviewer raw streams stay at raw/<agent>/td-stream.txt; the cross-reviewer merge stays at raw/td-stream-all.txt for inspection.
The --help text still listed td-stream-all.txt at the output root, but
the actual layout is:
raw/td-stream-all.txt (cross-reviewer merge, inside raw/)
td-stream.txt (copy at root, where reconcile auto-discovers)
End-to-end smoke test against real openclaw (Bruce on llm-env v1.6.0..HEAD):
- bundle + scp to nucleus.lan: clean
- openclaw agent invocation + JSON envelope parse: clean
- 4 findings extracted, written to raw/bruce/, merged to root: matches
the manual bench numbers
- status.json + multi-review-summary.json: structure correct
- exit 0, total 67s
This was referenced May 13, 2026
samestrin
added a commit
that referenced
this pull request
May 13, 2026
…#32) * feat(multireview): ContainerExec helper for running commands inside the gateway container Wraps: ssh <host> -- docker exec <c> sh -c "$(echo <b64> | base64 -d)" The base64 round-trip removes all quoting concerns for the caller's command. Mirrors SSHRun semantics: non-zero exit is data, not error; returned error means the call could not be executed. 8 unit tests via the execCommandContext swap. No real SSH, no real docker. 100% coverage of the new function. Needed because the openclaw-gateway container has no /tmp bind mount, so prior runs (PR #29, PR #30) wrote bundles and diffs to host /tmp where reviewers running inside the container could never see them. Subsequent commits will route ShipBundle and PreComputeDiff through this helper. * feat(multireview): ship bundle into container via staging + docker cp The openclaw-gateway container has no /tmp bind mount, so the previous sequence (mkdir + scp + clone, all on host) created files reviewers could never see. New sequence: 1. ssh -- docker exec mkdir <container workdir> 2. ssh -- mkdir <host staging dir> 3. scp bundle.git -> host staging 4. ssh -- docker cp <host staging>/bundle.git <container>:<workdir>/ 5. ssh -- docker exec git clone (inside container workdir) ShipBundleParams gains: - GatewayContainer (defaults to openclaw-gateway) - HostStagingDir (required; scp landing pad) ShipBundleResult.RemoteRepoPath now returns a container-local path. RemoteBundlePath was renamed to HostStagingBundlePath since the bundle on disk lives on the host, not the container. 7 bundle tests verify the 5-call sequence, container defaults, field-required validations, and per-step failure modes (container-mkdir fail, docker-cp fail, container-clone fail). The RemoteRepoPath-is-container-path invariant is asserted directly. multi_review_test.go mock updated to the renamed field. All existing tests still pass. * feat(multireview): route PreComputeDiff through the gateway container git diff + wc now run inside the openclaw-gateway container via ContainerExec, so the resulting diff.txt lands in the container's overlay /tmp where reviewers can actually see it. Writing to the host's /tmp (PR #30's approach) put the file in a filesystem the container couldn't read. PreComputeDiffParams gains GatewayContainer (defaults to "openclaw-gateway"). Inner pipeline shape and wc parsing are unchanged. 2 new tests: - TestPreComputeDiff_RoutedThroughContainer: outer ssh contains 'docker exec', container name, and 'base64 -d'; decoded inner is still 'git ... diff > diff.txt && wc -c && wc -l' - TestPreComputeDiff_DefaultsContainerName: empty GatewayContainer defaults via ContainerExec Existing tests updated: HeadRefDefaultsToHEAD now decodes the base64 inner payload to find the quoted 'HEAD' token (the outer ssh args no longer contain it in clear text). Coverage 91.3% on PreComputeDiff. * feat(multi_review): thread gateway-container through, split cleanup in two runMultiReview now passes: - GatewayContainer into BOTH ShipBundleParams and PreComputeDiffParams - HostStagingDir into ShipBundleParams as /tmp/multi-review-staging-<ts> Cleanup is now two defers (LIFO): 1. ContainerExec rm -rf <container workdir> via containerExecFn 2. SSHRun rm -rf <host staging dir> via sshRunFn Both honor --skip-cleanup. The new sshRunFn and containerExecFn package vars let tests assert on cleanup commands deterministically. 4 new tests: - GatewayContainerFlagsThreaded: --gateway-container value reaches both ShipBundle and PreComputeDiff (different default name verifies the thread isn't just defaulting both ends) - HostStagingDirThreadsToShipBundle: HostStagingDir is set, has the /tmp/multi-review-staging- prefix, and differs from RemoteWorkdir - BothCleanupsFire: on success, container rm -rf <workdir> and host rm -rf <staging> both run - SkipCleanupSkipsBoth: --skip-cleanup suppresses both channels Coverage on runMultiReview: 94.3%. * docs(multi_review): document container-routed filesystem ops + CHANGELOG entry docs/llm-support-commands.md: replaces the pre-compute-step section with a 6-step container-routed pipeline (mkdir container + mkdir host staging + scp + docker cp + git clone in container + diff pre-compute via docker exec). Adds failure-semantics rows for the new container failure modes. CHANGELOG.md: new Unreleased fix entry under llm-support describing the host/container filesystem boundary bug and the ContainerExec / ShipBundle / PreComputeDiff refactor. Includes the breaking rename of ShipBundleResult.RemoteBundlePath -> HostStagingBundlePath. * fix(multi_review): register cleanup defers before ShipBundle Adversarial review surfaced a pre-existing (PR #29) bug: if ShipBundle fails partway (e.g. container mkdir succeeds, then docker cp fails), the container workdir leaks because cleanup defers were registered only AFTER ShipBundle returned success. Fix: register both cleanup defers BEFORE the ShipBundle call. rm -rf on a path that was never created is a harmless no-op, so this is safe. New test TestMultiReview_CleanupFiresOnShipBundleFailure verifies both cleanup channels fire even when ShipBundle returns an error. Also cleaned up dead code in TestShipBundle_ContainerCloneFails (the test had an unused first execCommandContext mock immediately overwritten by a second). And tightened the --help text to mention container routing and clarify the diff lives inside the container, not on the remote host.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
llm-support multi_reviewsubcommand orchestrates a code-review fan-out across multiple openclaw reviewer agents on a remote host. Bundles a local repo, ships it via scp, invokes reviewers in parallel (with an opt-in serial lane for shared-quota providers), and collects per-reviewer TD streams + a merged cross-reviewer stream with reviewer attribution.group_tdnow supports two new feature-flagged trailing columns —REVIEWERS(comma-joined agent names frommulti_review) andCONFIDENCE(HIGH/MEDIUM/LOW based on multi-reviewer agreement). Both follow the same opt-in pattern as the existingSOURCEcolumn; legacy callers see no change.yaml.parseNumberwas silently coercing version-like strings like"4.30"to4.3, losing trailing-zero information. Now only coerces when the float round-trips through YAML marshal back to the exact input string.Architecture
Designed to be invoked by the claude-prompts companion PR when
review.mode: multiis set in.planning/.config/config.yaml:Failure semantics
partial: true, exit 0Per-reviewer retry happens only at the openclaw-call layer; mid-stream failures bubble up cleanly.
Test plan
group_tdSOURCE column behavior preservedgo test ./...)llm-env v1.6.0..HEAD— 4 findings produced in 67s wall, all expected output files present, summary structure correctNotes
execCommandContextin themultireviewpackage is a package variable that tests swap.invokeReviewerFnandshipBundleFnare the equivalent indirection in the orchestrator command.Longhelp string formulti_reviewwas wrong about the output layout in the first pass — corrected after the smoke test caught the inconsistency.