Skip to content

worker: remove TodoWrite tool#176

Open
anders-heimer wants to merge 1 commit into
sashiko-dev:mainfrom
anders-heimer:remove-todo-write-tool
Open

worker: remove TodoWrite tool#176
anders-heimer wants to merge 1 commit into
sashiko-dev:mainfrom
anders-heimer:remove-todo-write-tool

Conversation

@anders-heimer
Copy link
Copy Markdown
Contributor

Upstream review-prompts (vendored in third_party/prompts/) instruct the model to call TodoWrite for each analysis task, assuming Claude Code's native task-tracking tool. Sashiko's local todowrite implementation merely appended a line to a TODO.md file in the worktree and returned; it had no analytical effect on the review.

In practice this caused the model to spend a large fraction of its per-stage turn budget issuing TodoWrite calls. In one observed Stage 3 run, 27 of 46 turns were TodoWrite invocations, which pushed the stage toward the max_interactions limit and left insufficient budget for the actual callstack analysis.

Remove the TodoWrite tool entirely: the declaration in get_declarations_generic, the dispatcher entry, the todowrite method, the now-unused AsyncWriteExt import, and the companion unit test. The tool-name normalization test is updated to use a tool that still exists (list_dir). The DESIGN_REVIEW_WORKER.md references to todo_write and the Todo List state are dropped. Once the tool is no longer advertised, the upstream prompt instruction "Add each Task into a TodoWrite" becomes a harmless no-op.

Measured on a 9-entry Linux-kernel benchmark set with the Kiro CLI provider running claude-sonnet-4.5:

  • Detection rate: 4/9 (44%) -> 6/9 (67%).
  • Average turns per review: 99 -> 55.7 (-43.8%).
  • Average output tokens per review: 17306 -> 14428 (-16.6%).
  • Two previously missed patches (cifs SendReceive response-buffer leak, nci command workqueue use-after-free) are now detected correctly; the nci review dropped from 159 turns with no findings to 29 turns with a correct finding.

No regressions were observed: every patch detected before is still detected, with lower turn counts and comparable or smaller output.

@rgushchin
Copy link
Copy Markdown
Member

Thank you for the idea! I generally agree with the direction, but last time I tested it, it showed some regression. I'll run it with gemini and on a 1000 dataset - 10 is way too noisy for this.

@anders-heimer
Copy link
Copy Markdown
Contributor Author

Thank you for your quick feedback!

It will be interesting to see the results from a bigger dataset.

@anders-heimer
Copy link
Copy Markdown
Contributor Author

Hi @rgushchin ,

I have done some more benchmarking, again with the tiny benchmarking, usingQwen3-Coder-30B-A3B-Instruct-Q
5_K_M on an L40S. As you say the measurements are noisy, but the improvements makes sense theoretically at least.

Metric After (TodoWrite removed) Before
Detected 4/9 3/9
Missed 5/9 6/9
Runtime 5,430s (1.5h) 20,825s (5.8h)
Avg duration/patch 603s 2,314s
Total turns 949 1,334
Avg turns/patch 105.4 148.2
Format validation failures 343 306
Aborted reviews 24 23
Full review restarts 20 18

Br, Anders

@derekbarbosa
Copy link
Copy Markdown
Collaborator

Hey Anders,

The changes look good, and the results are promising!

You mentioned that this was done with the tiny benchmark, correct? Would you kindly elaborate?

  • How many trials of the tiny benchmark were performed (before/after)
  • Did you store the output results of the model's findings anywhere? I see detected & missed as entries there, but I am also curious if the review "output" response blabbed on about anything unrelated in either case (false positives)?

@anders-heimer
Copy link
Copy Markdown
Contributor Author

anders-heimer commented May 19, 2026

Hi Derek,

I only ran one A and B comparison per model (Sonnet and Qwen3-Coder-30B). I ran benchmarks/benchmark_tiny.json all four times.

I only had the db from the last run saved. With "unrelated" I mean the findings it didn't match the specific known bug being tested for.

Per-patch unrelated findings:

Target bug Before After
btrfs: stripe_uptodate_bitmap memleak 0 2
cgroup: NULL deref on dmem.max write 5 10
cgroup: RCU usage without read lock 5 9
nfc: workqueue UAF after rfkill 5 7
t7xx: MAX_SKB_FRAGS overflow 3 15
smb: response buffer memleak 1 2
smb: SendReceive() buffer leak 0 0
nvme: swapped sgl parameters 0 0
flex_proportions: deadlock 0 0

The before findings are fewer but more specific. Example from t7xx (3 findings total):
"In t7xx_dpmaif_rx_buf_alloc(), the error path at line 264 uses while (--i > 0) which skips unmapping index 0..."

The after findings are more numerous but vaguer. Example from t7xx (15 findings total):
"Hardware register access without power domain/clk checks"
"Potential integer overflow in ring buffer calculations"

It looks to me that removing TodoWrite freed up turn budget by not wasting it on no-ops, but in my run the model used that budget to produce more findings rather than deeper analysis.

Br, Anders Heimer

@derekbarbosa
Copy link
Copy Markdown
Collaborator

derekbarbosa commented May 20, 2026

thank you for the expanded findings!

The before findings are fewer but more specific. Example from t7xx (3 findings total):
"In t7xx_dpmaif_rx_buf_alloc(), the error path at line 264 uses while (--i > 0) which skips unmapping index 0..."

The after findings are more numerous but vaguer. Example from t7xx (15 findings total):
"Hardware register access without power domain/clk checks"
"Potential integer overflow in ring buffer calculations"

this is interesting behavior, if I had to guess, the TodoWrite specified what the finding was, and then spent time aggregating/validating them?

Do you have any access to the logs, if they exist? (example on prod instance via the "View Raw Log" button: https://sashiko.dev/#/log/54060)

It looks to me that removing TodoWrite freed up turn budget by not wasting it on no-ops, but in my run the model used that budget to produce more findings rather than deeper analysis.

Now I'm curious -- if adding a shim to prompts.rs, preferably one of the review stages (say stages 8/9/10) may "massage" out more precise findings. Something akin to "please be specific: elaborate what functions and what line numbers trigger the error"

Upstream review-prompts (vendored in third_party/prompts/) instruct the
model to call TodoWrite for each analysis task, assuming Claude Code's
native task-tracking tool. Sashiko's local todowrite implementation
merely appended a line to a TODO.md file in the worktree and returned;
it had no analytical effect on the review.

In practice this caused the model to spend a large fraction of its
per-stage turn budget issuing TodoWrite calls. In one observed Stage 3
run, 27 of 46 turns were TodoWrite invocations, which pushed the stage
toward the max_interactions limit and left insufficient budget for the
actual callstack analysis.

Remove the TodoWrite tool entirely: the declaration in
get_declarations_generic, the dispatcher entry, the todowrite method,
the now-unused AsyncWriteExt import, and the companion unit test. The
tool-name normalization test is updated to use a tool that still
exists (list_dir). The DESIGN_REVIEW_WORKER.md references to
todo_write and the Todo List state are dropped. Once the tool is no
longer advertised, the upstream prompt instruction "Add each Task
into a TodoWrite" becomes a harmless no-op.

Measured on a 9-entry Linux-kernel benchmark set with the Kiro CLI
provider running claude-sonnet-4.5:

- Detection rate: 4/9 (44%) -> 6/9 (67%).
- Average turns per review: 99 -> 55.7 (-43.8%).
- Average output tokens per review: 17306 -> 14428 (-16.6%).
- Two previously missed patches (cifs SendReceive response-buffer
  leak, nci command workqueue use-after-free) are now detected
  correctly; the nci review dropped from 159 turns with no findings
  to 29 turns with a correct finding.

No regressions were observed: every patch detected before is still
detected, with lower turn counts and comparable or smaller output.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Anders Heimer <anders.heimer@est.tech>
@anders-heimer anders-heimer force-pushed the remove-todo-write-tool branch from 1b67fa5 to 052b1dc Compare May 21, 2026 04:05
@anders-heimer
Copy link
Copy Markdown
Contributor Author

Thanks Derek, I had missed that aspect.

Unfortunately, the initial logs from the runs I mentioned are gone. I used Kiro to collect whatever information I still had lying around, hence the odd format. Those notes are attached, I think it shows what you are after.

From the logs, it does not look like TodoWrite was discovering bugs directly. Most TodoWrite calls were checklist/status churn. A few calls did contain concrete candidate-like text, but that only seems useful within the same stage unless the model also emits the candidate as a structured concern. As I understand it, later stages do not consume TODO.md as persistent review state.

I tested your theory by preserving the useful part in the finding pipeline by emitting concrete suspected bugs with structured location data, for later stages. In the tiny benchmark I ran with this change, that did preserve structured locations for the final findings.

The downside is that the richer structured output appears to put more pressure on the JSON path, at least with Kiro. In my partial run it caused more malformed-output/schema failures and timeout churn. I need to look in to that more in detail.
review-7-patchset-1.html
review-2-patchset-5.html
review-8-patchset-7.html
review-1-patchset-3.html
review-5-patchset-8.html
review-6-patchset-4.html
review-3-patchset-6.html
review-4-patchset-2.html
review-1-patchset-1.html
review-2-patchset-2.html
review-5-patchset-1.html
review-6-patchset-6.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants