feat: multi_review subcommand + group_td REVIEWERS/CONFIDENCE columns by samestrin · Pull Request #29 · samestrin/llm-tools

samestrin · 2026-05-12T18:43:48Z

Summary

New llm-support multi_review subcommand orchestrates a code-review fan-out across multiple openclaw reviewer agents on a remote host. Bundles a local repo, ships it via scp, invokes reviewers in parallel (with an opt-in serial lane for shared-quota providers), and collects per-reviewer TD streams + a merged cross-reviewer stream with reviewer attribution.
group_td now supports two new feature-flagged trailing columns — REVIEWERS (comma-joined agent names from multi_review) and CONFIDENCE (HIGH/MEDIUM/LOW based on multi-reviewer agreement). Both follow the same opt-in pattern as the existing SOURCE column; legacy callers see no change.
Side fix: yaml.parseNumber was silently coercing version-like strings like "4.30" to 4.3, losing trailing-zero information. Now only coerces when the float round-trips through YAML marshal back to the exact input string.

Architecture

Designed to be invoked by the claude-prompts companion PR when review.mode: multi is set in .planning/.config/config.yaml:

${SPRINT_PATH}/code-review/multi-agent/
  raw/<agent>/{review.md, td-stream.txt, status.json, response.json}
  raw/td-stream-all.txt        # cross-reviewer merge
  td-stream.txt                # copy at root for reconcile auto-discovery
  multi-review-summary.json    # per-reviewer status, counts, partial flag

Failure semantics

Scenario	Behavior
Bundle/ship fails	Hard-stop with error naming the host
1+ reviewer fails, 1+ succeeds	Continue with successes, `partial: true`, exit 0
All reviewers fail	Exit 1 with summary path

Per-reviewer retry happens only at the openclaw-call layer; mid-stream failures bubble up cleanly.

Test plan

Unit tests for SSH wrapper (timeout, non-zero exit handling, stream separation): 81.8% line coverage
Unit tests for git bundle + ship-to-remote pipeline (success, partial failure, missing inputs): 82.4%
Unit tests for openclaw envelope parsing (happy path, malformed JSON, multiple payloads, SSH errors): 84%
Unit tests for TD stream extraction + merge (regex anchoring, multi-source merge with header): 83.2%
Unit tests for multi_review orchestrator (happy path, partial failure, all-fail, ship hard-stop, two-lane): 93.7% of multi_review.go
Backwards-compat tests for group_td SOURCE column behavior preserved
New tests for REVIEWERS + CONFIDENCE columns (with-cols, without-cols, partial, checkbox)
Full repo test suite passes (go test ./...)
End-to-end smoke test against real openclaw: Bruce reviewer on llm-env v1.6.0..HEAD — 4 findings produced in 67s wall, all expected output files present, summary structure correct

Notes

Test seam: execCommandContext in the multireview package is a package variable that tests swap. invokeReviewerFn and shipBundleFn are the equivalent indirection in the orchestrator command.
The Long help string for multi_review was wrong about the output layout in the first pass — corrected after the smoke test caught the inconsistency.
All 8 commits are atomic and pass tests independently; commit order = TDD order (SSH → bundle → openclaw → stream → orchestrator → group_td columns → docs).

…t coerce parseNumber was silently converting "4.30" → float 4.3 → "4.3" on emit. Now only coerces to float when the value round-trips through YAML marshal back to the exact input string. Adds regression tests for both the fix and the clean-numeric case to prevent over-correction.

First piece of the multi-reviewer pipeline. Wraps /usr/bin/ssh and /usr/bin/scp with context-driven timeouts, separate stdout/stderr capture (callers parse stdout JSON, stderr stays diagnostic), and explicit handling of non-zero remote exit codes as data rather than errors. Uses package-level execCommandContext indirection so tests inject scripted subprocess behavior via /bin/sh -c without depending on real SSH targets.

CreateBundle writes a full-history bundle (HEAD --branches --tags) so the remote clone can resolve any base tag without prerequisite gaps. Range-only bundles silently produce uncloneable artifacts; we learned this the hard way during the Bruce probe. ShipBundle orchestrates the full bundle → scp → remote clone sequence with shared timeout. Returns the remote clone path so reviewer invocations can read from a known location. Single-quote shell escaping for the workdir and repo name keeps directory names with spaces or unusual chars safe. End-to-end test against a real fixture repo verifies the bundle clones cleanly and resolves the original tag. Pipeline test validates the three SSH/SCP calls happen in the right order via execCommandContext mocking.

…lope InvokeReviewer wraps the `docker exec openclaw-gateway openclaw agent --json` call we worked out during the bench probes. The task message is base64 encoded for transport so any prompt content (quotes, newlines, code fences, dollar signs) crosses the SSH boundary intact. Parses the envelope shape we observed live: {runId, status, summary, result: {payloads: [{text, mediaUrl}], meta: {durationMs, aborted, agentMeta: {model, sessionId, provider}}}} Concatenates payload text into a single ReviewProse field (matches the behavior of our run-bench.sh extraction) and preserves the raw JSON for diagnosis. Distinguishes malformed JSON, empty response, SSH-level failure, and non-zero exit codes — each surfaces with a different error message.

…viewer merge ExtractTDLines anchors on a strict regex (severity token followed immediately by a pipe at line start) so prose mentions of "HIGH severity" or similar are not mistaken for findings. CRITICAL/HIGH/MEDIUM/LOW all supported. WriteReviewerOutput drops four artifacts per reviewer under <root>/<agent>/: review.md — prose for human reading td-stream.txt — pipe-delimited findings with |<agent> appended status.json — model, duration, status, td count response.json — raw openclaw envelope (preserved for replay) MergeStreams concatenates per-reviewer streams into td-stream-all.txt with a header comment. Missing reviewer dirs are silently skipped — a reviewer might have failed to invoke and we proceed with what completed.

The top-level orchestrator. Wires together bundle/ship/invoke/merge into a single command that /code-review can call when review.mode: multi is set. Two-lane execution: --reviewers all agents involved --serial-reviewers subset that runs sequentially after the parallel lane (for shared rate limits) Failure semantics: - Bundle/ship failure → hard-stop (no point continuing without diff staged on remote) - Per-reviewer failure → recorded in summary, other reviewers continue, exit 0 with partial: true - All reviewers fail → exit 1 with summary path Test seams use package-level invokeReviewerFn and shipBundleFn variables so tests inject mock behavior without spinning up SSH/openclaw. Cleanup of the remote workdir happens via deferred SSH rm -rf with a tight 30s timeout so test failures or panics don't block. Per-function coverage on multi_review.go: 67-100%, dominated by the runMultiReview happy path at 93.7%.

Multi-reviewer code review produces findings with reviewer attribution (which agents caught it) and confidence (how many agreed). These travel through the TD stream and need to land in the TD README.md table. Mirrors the existing SOURCE column feature-flag pattern: a column is emitted only when at least one input row carries a non-empty value for it. Callers that don't pass REVIEWERS/CONFIDENCE through --headers see the same output they always saw — 7-column legacy, 8-column with Source, etc. While here, refactor the markdown header + row writers to assemble cells dynamically instead of enumerating every (checkbox × source) combination. The old switch already had 4 cases; adding 2 more feature-flagged columns would have ballooned it to 16. The dynamic builder preserves exact cell spacing semantics (header empty-cell `| |` single-space, data empty-cell `| |` two-space) so downstream column-position parsers see no change. REVIEWERS is stored comma-joined in the stream ("bruce,greta") and rendered with a space after each comma in the cell ("bruce, greta") for readability.

CHANGELOG gets an Unreleased section covering multi_review, the group_td column additions, and the yaml parseNumber regression fix (separately committed but worth surfacing in the release notes). docs/llm-support-commands.md gets a multi_review reference entry under Technical Debt Management with flag tables, output layout, failure semantics, and three example invocations. The group-td section gets a small note about the new optional trailing columns.

…put root Renamed the top-level merged file from td-stream-all.txt to td-stream.txt so /reconcile-code-review's auto-discovery (any code-review/<source>/ dir with a td-stream.txt is a source) treats multi-agent as one unified source. The per-reviewer raw streams stay at raw/<agent>/td-stream.txt; the cross-reviewer merge stays at raw/td-stream-all.txt for inspection.

The --help text still listed td-stream-all.txt at the output root, but the actual layout is: raw/td-stream-all.txt (cross-reviewer merge, inside raw/) td-stream.txt (copy at root, where reconcile auto-discovers) End-to-end smoke test against real openclaw (Bruce on llm-env v1.6.0..HEAD): - bundle + scp to nucleus.lan: clean - openclaw agent invocation + JSON envelope parse: clean - 4 findings extracted, written to raw/bruce/, merged to root: matches the manual bench numbers - status.json + multi-review-summary.json: structure correct - exit 0, total 67s

…#32) * feat(multireview): ContainerExec helper for running commands inside the gateway container Wraps: ssh <host> -- docker exec <c> sh -c "$(echo <b64> | base64 -d)" The base64 round-trip removes all quoting concerns for the caller's command. Mirrors SSHRun semantics: non-zero exit is data, not error; returned error means the call could not be executed. 8 unit tests via the execCommandContext swap. No real SSH, no real docker. 100% coverage of the new function. Needed because the openclaw-gateway container has no /tmp bind mount, so prior runs (PR #29, PR #30) wrote bundles and diffs to host /tmp where reviewers running inside the container could never see them. Subsequent commits will route ShipBundle and PreComputeDiff through this helper. * feat(multireview): ship bundle into container via staging + docker cp The openclaw-gateway container has no /tmp bind mount, so the previous sequence (mkdir + scp + clone, all on host) created files reviewers could never see. New sequence: 1. ssh -- docker exec mkdir <container workdir> 2. ssh -- mkdir <host staging dir> 3. scp bundle.git -> host staging 4. ssh -- docker cp <host staging>/bundle.git <container>:<workdir>/ 5. ssh -- docker exec git clone (inside container workdir) ShipBundleParams gains: - GatewayContainer (defaults to openclaw-gateway) - HostStagingDir (required; scp landing pad) ShipBundleResult.RemoteRepoPath now returns a container-local path. RemoteBundlePath was renamed to HostStagingBundlePath since the bundle on disk lives on the host, not the container. 7 bundle tests verify the 5-call sequence, container defaults, field-required validations, and per-step failure modes (container-mkdir fail, docker-cp fail, container-clone fail). The RemoteRepoPath-is-container-path invariant is asserted directly. multi_review_test.go mock updated to the renamed field. All existing tests still pass. * feat(multireview): route PreComputeDiff through the gateway container git diff + wc now run inside the openclaw-gateway container via ContainerExec, so the resulting diff.txt lands in the container's overlay /tmp where reviewers can actually see it. Writing to the host's /tmp (PR #30's approach) put the file in a filesystem the container couldn't read. PreComputeDiffParams gains GatewayContainer (defaults to "openclaw-gateway"). Inner pipeline shape and wc parsing are unchanged. 2 new tests: - TestPreComputeDiff_RoutedThroughContainer: outer ssh contains 'docker exec', container name, and 'base64 -d'; decoded inner is still 'git ... diff > diff.txt && wc -c && wc -l' - TestPreComputeDiff_DefaultsContainerName: empty GatewayContainer defaults via ContainerExec Existing tests updated: HeadRefDefaultsToHEAD now decodes the base64 inner payload to find the quoted 'HEAD' token (the outer ssh args no longer contain it in clear text). Coverage 91.3% on PreComputeDiff. * feat(multi_review): thread gateway-container through, split cleanup in two runMultiReview now passes: - GatewayContainer into BOTH ShipBundleParams and PreComputeDiffParams - HostStagingDir into ShipBundleParams as /tmp/multi-review-staging-<ts> Cleanup is now two defers (LIFO): 1. ContainerExec rm -rf <container workdir> via containerExecFn 2. SSHRun rm -rf <host staging dir> via sshRunFn Both honor --skip-cleanup. The new sshRunFn and containerExecFn package vars let tests assert on cleanup commands deterministically. 4 new tests: - GatewayContainerFlagsThreaded: --gateway-container value reaches both ShipBundle and PreComputeDiff (different default name verifies the thread isn't just defaulting both ends) - HostStagingDirThreadsToShipBundle: HostStagingDir is set, has the /tmp/multi-review-staging- prefix, and differs from RemoteWorkdir - BothCleanupsFire: on success, container rm -rf <workdir> and host rm -rf <staging> both run - SkipCleanupSkipsBoth: --skip-cleanup suppresses both channels Coverage on runMultiReview: 94.3%. * docs(multi_review): document container-routed filesystem ops + CHANGELOG entry docs/llm-support-commands.md: replaces the pre-compute-step section with a 6-step container-routed pipeline (mkdir container + mkdir host staging + scp + docker cp + git clone in container + diff pre-compute via docker exec). Adds failure-semantics rows for the new container failure modes. CHANGELOG.md: new Unreleased fix entry under llm-support describing the host/container filesystem boundary bug and the ContainerExec / ShipBundle / PreComputeDiff refactor. Includes the breaking rename of ShipBundleResult.RemoteBundlePath -> HostStagingBundlePath. * fix(multi_review): register cleanup defers before ShipBundle Adversarial review surfaced a pre-existing (PR #29) bug: if ShipBundle fails partway (e.g. container mkdir succeeds, then docker cp fails), the container workdir leaks because cleanup defers were registered only AFTER ShipBundle returned success. Fix: register both cleanup defers BEFORE the ShipBundle call. rm -rf on a path that was never created is a harmless no-op, so this is safe. New test TestMultiReview_CleanupFiresOnShipBundleFailure verifies both cleanup channels fire even when ShipBundle returns an error. Also cleaned up dead code in TestShipBundle_ContainerCloneFails (the test had an unused first execCommandContext mock immediately overwritten by a second). And tightened the --help text to mention container routing and clarify the diff lives inside the container, not on the remote host.

samestrin added 10 commits May 11, 2026 18:04

samestrin merged commit 89c4036 into main May 12, 2026
8 checks passed

samestrin deleted the feat/multi-reviewer-code-review branch May 12, 2026 19:39

This was referenced May 13, 2026

fix(multi_review): pre-compute diff so reviewers can't hallucinate clone failures #30

Merged

fix(multi_review): route filesystem ops through the gateway container #32

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: multi_review subcommand + group_td REVIEWERS/CONFIDENCE columns#29

feat: multi_review subcommand + group_td REVIEWERS/CONFIDENCE columns#29
samestrin merged 10 commits into
mainfrom
feat/multi-reviewer-code-review

samestrin commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samestrin commented May 12, 2026

Summary

Architecture

Failure semantics

Test plan

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant