Skip to content

feat: multi_review subcommand + group_td REVIEWERS/CONFIDENCE columns#29

Merged
samestrin merged 10 commits into
mainfrom
feat/multi-reviewer-code-review
May 12, 2026
Merged

feat: multi_review subcommand + group_td REVIEWERS/CONFIDENCE columns#29
samestrin merged 10 commits into
mainfrom
feat/multi-reviewer-code-review

Conversation

@samestrin
Copy link
Copy Markdown
Owner

Summary

  • New llm-support multi_review subcommand orchestrates a code-review fan-out across multiple openclaw reviewer agents on a remote host. Bundles a local repo, ships it via scp, invokes reviewers in parallel (with an opt-in serial lane for shared-quota providers), and collects per-reviewer TD streams + a merged cross-reviewer stream with reviewer attribution.
  • group_td now supports two new feature-flagged trailing columns — REVIEWERS (comma-joined agent names from multi_review) and CONFIDENCE (HIGH/MEDIUM/LOW based on multi-reviewer agreement). Both follow the same opt-in pattern as the existing SOURCE column; legacy callers see no change.
  • Side fix: yaml.parseNumber was silently coercing version-like strings like "4.30" to 4.3, losing trailing-zero information. Now only coerces when the float round-trips through YAML marshal back to the exact input string.

Architecture

Designed to be invoked by the claude-prompts companion PR when review.mode: multi is set in .planning/.config/config.yaml:

${SPRINT_PATH}/code-review/multi-agent/
  raw/<agent>/{review.md, td-stream.txt, status.json, response.json}
  raw/td-stream-all.txt        # cross-reviewer merge
  td-stream.txt                # copy at root for reconcile auto-discovery
  multi-review-summary.json    # per-reviewer status, counts, partial flag

Failure semantics

Scenario Behavior
Bundle/ship fails Hard-stop with error naming the host
1+ reviewer fails, 1+ succeeds Continue with successes, partial: true, exit 0
All reviewers fail Exit 1 with summary path

Per-reviewer retry happens only at the openclaw-call layer; mid-stream failures bubble up cleanly.

Test plan

  • Unit tests for SSH wrapper (timeout, non-zero exit handling, stream separation): 81.8% line coverage
  • Unit tests for git bundle + ship-to-remote pipeline (success, partial failure, missing inputs): 82.4%
  • Unit tests for openclaw envelope parsing (happy path, malformed JSON, multiple payloads, SSH errors): 84%
  • Unit tests for TD stream extraction + merge (regex anchoring, multi-source merge with header): 83.2%
  • Unit tests for multi_review orchestrator (happy path, partial failure, all-fail, ship hard-stop, two-lane): 93.7% of multi_review.go
  • Backwards-compat tests for group_td SOURCE column behavior preserved
  • New tests for REVIEWERS + CONFIDENCE columns (with-cols, without-cols, partial, checkbox)
  • Full repo test suite passes (go test ./...)
  • End-to-end smoke test against real openclaw: Bruce reviewer on llm-env v1.6.0..HEAD — 4 findings produced in 67s wall, all expected output files present, summary structure correct

Notes

  • Test seam: execCommandContext in the multireview package is a package variable that tests swap. invokeReviewerFn and shipBundleFn are the equivalent indirection in the orchestrator command.
  • The Long help string for multi_review was wrong about the output layout in the first pass — corrected after the smoke test caught the inconsistency.
  • All 8 commits are atomic and pass tests independently; commit order = TDD order (SSH → bundle → openclaw → stream → orchestrator → group_td columns → docs).

samestrin added 10 commits May 11, 2026 18:04
…t coerce

parseNumber was silently converting "4.30" → float 4.3 → "4.3" on emit.
Now only coerces to float when the value round-trips through YAML marshal
back to the exact input string. Adds regression tests for both the fix and
the clean-numeric case to prevent over-correction.
First piece of the multi-reviewer pipeline. Wraps /usr/bin/ssh and
/usr/bin/scp with context-driven timeouts, separate stdout/stderr capture
(callers parse stdout JSON, stderr stays diagnostic), and explicit handling
of non-zero remote exit codes as data rather than errors.

Uses package-level execCommandContext indirection so tests inject scripted
subprocess behavior via /bin/sh -c without depending on real SSH targets.
CreateBundle writes a full-history bundle (HEAD --branches --tags) so the
remote clone can resolve any base tag without prerequisite gaps. Range-only
bundles silently produce uncloneable artifacts; we learned this the hard
way during the Bruce probe.

ShipBundle orchestrates the full bundle → scp → remote clone sequence with
shared timeout. Returns the remote clone path so reviewer invocations can
read from a known location. Single-quote shell escaping for the workdir
and repo name keeps directory names with spaces or unusual chars safe.

End-to-end test against a real fixture repo verifies the bundle clones
cleanly and resolves the original tag. Pipeline test validates the three
SSH/SCP calls happen in the right order via execCommandContext mocking.
…lope

InvokeReviewer wraps the `docker exec openclaw-gateway openclaw agent --json`
call we worked out during the bench probes. The task message is base64
encoded for transport so any prompt content (quotes, newlines, code
fences, dollar signs) crosses the SSH boundary intact.

Parses the envelope shape we observed live:
  {runId, status, summary, result: {payloads: [{text, mediaUrl}], meta: {durationMs, aborted, agentMeta: {model, sessionId, provider}}}}

Concatenates payload text into a single ReviewProse field (matches the
behavior of our run-bench.sh extraction) and preserves the raw JSON for
diagnosis. Distinguishes malformed JSON, empty response, SSH-level failure,
and non-zero exit codes — each surfaces with a different error message.
…viewer merge

ExtractTDLines anchors on a strict regex (severity token followed
immediately by a pipe at line start) so prose mentions of "HIGH severity"
or similar are not mistaken for findings. CRITICAL/HIGH/MEDIUM/LOW all
supported.

WriteReviewerOutput drops four artifacts per reviewer under <root>/<agent>/:
  review.md     — prose for human reading
  td-stream.txt — pipe-delimited findings with |<agent> appended
  status.json   — model, duration, status, td count
  response.json — raw openclaw envelope (preserved for replay)

MergeStreams concatenates per-reviewer streams into td-stream-all.txt with
a header comment. Missing reviewer dirs are silently skipped — a reviewer
might have failed to invoke and we proceed with what completed.
The top-level orchestrator. Wires together bundle/ship/invoke/merge into a
single command that /code-review can call when review.mode: multi is set.

Two-lane execution:
  --reviewers          all agents involved
  --serial-reviewers   subset that runs sequentially after the parallel lane
                       (for shared rate limits)

Failure semantics:
  - Bundle/ship failure → hard-stop (no point continuing without diff
    staged on remote)
  - Per-reviewer failure → recorded in summary, other reviewers continue,
    exit 0 with partial: true
  - All reviewers fail → exit 1 with summary path

Test seams use package-level invokeReviewerFn and shipBundleFn variables
so tests inject mock behavior without spinning up SSH/openclaw. Cleanup
of the remote workdir happens via deferred SSH rm -rf with a tight
30s timeout so test failures or panics don't block.

Per-function coverage on multi_review.go: 67-100%, dominated by the
runMultiReview happy path at 93.7%.
Multi-reviewer code review produces findings with reviewer attribution
(which agents caught it) and confidence (how many agreed). These travel
through the TD stream and need to land in the TD README.md table.

Mirrors the existing SOURCE column feature-flag pattern: a column is
emitted only when at least one input row carries a non-empty value for it.
Callers that don't pass REVIEWERS/CONFIDENCE through --headers see the
same output they always saw — 7-column legacy, 8-column with Source, etc.

While here, refactor the markdown header + row writers to assemble cells
dynamically instead of enumerating every (checkbox × source) combination.
The old switch already had 4 cases; adding 2 more feature-flagged columns
would have ballooned it to 16. The dynamic builder preserves exact cell
spacing semantics (header empty-cell `| |` single-space, data empty-cell
`|  |` two-space) so downstream column-position parsers see no change.

REVIEWERS is stored comma-joined in the stream ("bruce,greta") and
rendered with a space after each comma in the cell ("bruce, greta") for
readability.
CHANGELOG gets an Unreleased section covering multi_review, the
group_td column additions, and the yaml parseNumber regression fix
(separately committed but worth surfacing in the release notes).

docs/llm-support-commands.md gets a multi_review reference entry under
Technical Debt Management with flag tables, output layout, failure
semantics, and three example invocations. The group-td section gets a
small note about the new optional trailing columns.
…put root

Renamed the top-level merged file from td-stream-all.txt to td-stream.txt
so /reconcile-code-review's auto-discovery (any code-review/<source>/ dir
with a td-stream.txt is a source) treats multi-agent as one unified
source. The per-reviewer raw streams stay at raw/<agent>/td-stream.txt;
the cross-reviewer merge stays at raw/td-stream-all.txt for inspection.
The --help text still listed td-stream-all.txt at the output root, but
the actual layout is:
  raw/td-stream-all.txt   (cross-reviewer merge, inside raw/)
  td-stream.txt           (copy at root, where reconcile auto-discovers)

End-to-end smoke test against real openclaw (Bruce on llm-env v1.6.0..HEAD):
  - bundle + scp to nucleus.lan: clean
  - openclaw agent invocation + JSON envelope parse: clean
  - 4 findings extracted, written to raw/bruce/, merged to root: matches
    the manual bench numbers
  - status.json + multi-review-summary.json: structure correct
  - exit 0, total 67s
@samestrin samestrin merged commit 89c4036 into main May 12, 2026
8 checks passed
@samestrin samestrin deleted the feat/multi-reviewer-code-review branch May 12, 2026 19:39
samestrin added a commit that referenced this pull request May 13, 2026
…#32)

* feat(multireview): ContainerExec helper for running commands inside the gateway container

Wraps:  ssh <host> -- docker exec <c> sh -c "$(echo <b64> | base64 -d)"

The base64 round-trip removes all quoting concerns for the caller's
command. Mirrors SSHRun semantics: non-zero exit is data, not error;
returned error means the call could not be executed.

8 unit tests via the execCommandContext swap. No real SSH, no real
docker. 100% coverage of the new function.

Needed because the openclaw-gateway container has no /tmp bind mount,
so prior runs (PR #29, PR #30) wrote bundles and diffs to host /tmp
where reviewers running inside the container could never see them.
Subsequent commits will route ShipBundle and PreComputeDiff through
this helper.

* feat(multireview): ship bundle into container via staging + docker cp

The openclaw-gateway container has no /tmp bind mount, so the previous
sequence (mkdir + scp + clone, all on host) created files reviewers
could never see. New sequence:

  1. ssh -- docker exec mkdir <container workdir>
  2. ssh -- mkdir <host staging dir>
  3. scp bundle.git -> host staging
  4. ssh -- docker cp <host staging>/bundle.git <container>:<workdir>/
  5. ssh -- docker exec git clone (inside container workdir)

ShipBundleParams gains:
- GatewayContainer (defaults to openclaw-gateway)
- HostStagingDir (required; scp landing pad)

ShipBundleResult.RemoteRepoPath now returns a container-local path.
RemoteBundlePath was renamed to HostStagingBundlePath since the bundle
on disk lives on the host, not the container.

7 bundle tests verify the 5-call sequence, container defaults,
field-required validations, and per-step failure modes
(container-mkdir fail, docker-cp fail, container-clone fail). The
RemoteRepoPath-is-container-path invariant is asserted directly.

multi_review_test.go mock updated to the renamed field. All existing
tests still pass.

* feat(multireview): route PreComputeDiff through the gateway container

git diff + wc now run inside the openclaw-gateway container via
ContainerExec, so the resulting diff.txt lands in the container's
overlay /tmp where reviewers can actually see it. Writing to the
host's /tmp (PR #30's approach) put the file in a filesystem the
container couldn't read.

PreComputeDiffParams gains GatewayContainer (defaults to
"openclaw-gateway"). Inner pipeline shape and wc parsing are
unchanged.

2 new tests:
- TestPreComputeDiff_RoutedThroughContainer: outer ssh contains
  'docker exec', container name, and 'base64 -d'; decoded inner
  is still 'git ... diff > diff.txt && wc -c && wc -l'
- TestPreComputeDiff_DefaultsContainerName: empty GatewayContainer
  defaults via ContainerExec

Existing tests updated: HeadRefDefaultsToHEAD now decodes the
base64 inner payload to find the quoted 'HEAD' token (the outer
ssh args no longer contain it in clear text).

Coverage 91.3% on PreComputeDiff.

* feat(multi_review): thread gateway-container through, split cleanup in two

runMultiReview now passes:
- GatewayContainer into BOTH ShipBundleParams and PreComputeDiffParams
- HostStagingDir into ShipBundleParams as /tmp/multi-review-staging-<ts>

Cleanup is now two defers (LIFO):
1. ContainerExec rm -rf <container workdir>  via containerExecFn
2. SSHRun        rm -rf <host staging dir>   via sshRunFn

Both honor --skip-cleanup. The new sshRunFn and containerExecFn
package vars let tests assert on cleanup commands deterministically.

4 new tests:
- GatewayContainerFlagsThreaded: --gateway-container value reaches both
  ShipBundle and PreComputeDiff (different default name verifies the
  thread isn't just defaulting both ends)
- HostStagingDirThreadsToShipBundle: HostStagingDir is set, has the
  /tmp/multi-review-staging- prefix, and differs from RemoteWorkdir
- BothCleanupsFire: on success, container rm -rf <workdir> and host
  rm -rf <staging> both run
- SkipCleanupSkipsBoth: --skip-cleanup suppresses both channels

Coverage on runMultiReview: 94.3%.

* docs(multi_review): document container-routed filesystem ops + CHANGELOG entry

docs/llm-support-commands.md: replaces the pre-compute-step section
with a 6-step container-routed pipeline (mkdir container + mkdir host
staging + scp + docker cp + git clone in container + diff pre-compute
via docker exec). Adds failure-semantics rows for the new container
failure modes.

CHANGELOG.md: new Unreleased fix entry under llm-support describing
the host/container filesystem boundary bug and the ContainerExec /
ShipBundle / PreComputeDiff refactor. Includes the breaking rename
of ShipBundleResult.RemoteBundlePath -> HostStagingBundlePath.

* fix(multi_review): register cleanup defers before ShipBundle

Adversarial review surfaced a pre-existing (PR #29) bug: if ShipBundle
fails partway (e.g. container mkdir succeeds, then docker cp fails),
the container workdir leaks because cleanup defers were registered
only AFTER ShipBundle returned success.

Fix: register both cleanup defers BEFORE the ShipBundle call. rm -rf
on a path that was never created is a harmless no-op, so this is safe.

New test TestMultiReview_CleanupFiresOnShipBundleFailure verifies
both cleanup channels fire even when ShipBundle returns an error.

Also cleaned up dead code in TestShipBundle_ContainerCloneFails (the
test had an unused first execCommandContext mock immediately
overwritten by a second). And tightened the --help text to mention
container routing and clarify the diff lives inside the container,
not on the remote host.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant