Skip to content

fix(multi_review): directive large-diff workflow + aborted-reviewer status#34

Merged
samestrin merged 2 commits into
mainfrom
fix/reviewer-aborted-status
May 15, 2026
Merged

fix(multi_review): directive large-diff workflow + aborted-reviewer status#34
samestrin merged 2 commits into
mainfrom
fix/reviewer-aborted-status

Conversation

@samestrin
Copy link
Copy Markdown
Owner

Summary

Production run surfaced a 1.2 MB diff where kai (kimi-k2.6-coding) made 148 tool calls and was aborted by the openclaw harness before producing a review. Two problems:

  1. Kai infinite-explored. The soft "if your context budget is tight" hint in the task message wasn't strong enough to redirect a reasoning model away from exhaustive exploration.
  2. The failure was invisible in the summary. Kai's response carried Aborted=true, but the summary reported status: ok with tdLineCount: 0 — indistinguishable from a reviewer that ran cleanly and found nothing.

Changes

1. Directive large-diff workflow (buildDefaultTaskMessage)

For diffs >1MB, the task message now contains a REQUIRED workflow:

  • (a) Run git diff --stat first, pick at most 10 files (largest + security-sensitive)
  • (b) For picked files only, run git diff <base>..<head> -- <path> — do NOT cat the full diff.txt
  • (c) Stop exploring after inspecting picked files; write the review
  • (d) Hard tool-call budget: aim for 15, ceiling at 20 — if you're past 20 without findings you're over-exploring

2. Aborted-reviewer status reporting (runMultiReview)

invokeOne now checks res.Aborted after WriteReviewerOutput and sets status: "aborted" with an explanatory error message. Aborted reviewers are excluded from okCount and from the merge step, so:

  • All reviewers aborted → exit non-zero (treated like all-fail)
  • Mixed pool with some aborted → partial: true, merged stream excludes aborted agents' empty output
  • Summary clearly distinguishes the two failure modes

ReviewerStatus.Status doc comment updated: "ok" | "failed" | "aborted" | "skipped".

Test plan

  • go test ./... passes; commands package coverage 80.7%
  • New TestMultiReview_AbortedReviewerReportsAbortedStatus — sole aborted reviewer → status="aborted", run exits non-zero
  • New TestMultiReview_AbortedReviewerCountedAsNotOk — mixed pool sets partial:true; aborted agent excluded from merge
  • Updated TestMultiReview_LargeDiffWarning — asserts REQUIRED wording, tool-call budget, "Stop exploring" directive
  • Existing happy-path tests still pass (mockResultFor doesn't set Aborted)

Adversarial review

  • okCount==0 triggers existing "all reviewers failed" error path
  • partial calculation okCount > 0 && okCount < len(allReviewers) still works — aborted counts as not-ok
  • ✅ Backward compat for any external consumer of summary JSON: existing "failed" and "ok" values unchanged; only adds a new value

samestrin added 2 commits May 15, 2026 12:28
…rted status

Two changes:

1. Strengthened the large-diff (>1MB) workflow in buildDefaultTaskMessage
   from a soft \"if your context budget is tight\" suggestion into a
   directive REQUIRED workflow:
   - Step a: 'git diff --stat' first, pick at most 10 files
   - Step b: per-file diff for picked files only, not the whole diff.txt
   - Step c: stop exploring, write findings
   - Step d: 15-tool-call budget, hard cap at 20
   Updates the LargeDiffWarning test to assert the new wording.

   Production observation: kai (kimi-k2.6-coding) burned its 30-min
   harness budget on a 1.2 MB diff making 148 tool calls and never
   producing findings. The soft hint wasn't strong enough to redirect
   reasoning models away from exhaustive exploration.

2. Added RED tests for the upcoming status-reporting fix:
   - TestMultiReview_AbortedReviewerReportsAbortedStatus: aborted sole
     reviewer should report status: \"aborted\", and the run should
     exit non-zero (treated like all-fail).
   - TestMultiReview_AbortedReviewerCountedAsNotOk: in a mixed pool,
     aborted reviewers don't count toward okCount, so partial:true
     should fire and the merged stream excludes their (empty) output.

   Both fail against current impl. GREEN follows.
When openclaw aborts an agent before it produces a review (tool-call
ceiling, per-turn time limit, etc.), the InvokeReviewerResult carries
Aborted=true but a defined Status field (\"ok\"). Previously runMultiReview
ignored Aborted and reported status: \"ok\" with 0 findings —
indistinguishable from a reviewer that ran cleanly and just didn't find
anything.

New behavior: invokeOne checks res.Aborted after WriteReviewerOutput and
sets status=\"aborted\" with an explanatory error message. Aborted
reviewers are excluded from okCount and from successAgents (so merged
stream skips their empty output). If all reviewers abort, the run exits
non-zero — treated like all-fail.

Production trigger: kai (kimi-k2.6-coding) made 148 tool calls on a
1.2 MB diff and was aborted by the openclaw harness before producing a
review. The summary said 'status: ok, 0 findings' which masked the
actual failure mode.

Adds CHANGELOG entry covering this fix plus the companion large-diff
workflow change shipped in the previous commit.
@samestrin samestrin merged commit 8108446 into main May 15, 2026
8 checks passed
@samestrin samestrin deleted the fix/reviewer-aborted-status branch May 15, 2026 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant