worker: remove TodoWrite tool#176
Conversation
|
Thank you for the idea! I generally agree with the direction, but last time I tested it, it showed some regression. I'll run it with gemini and on a 1000 dataset - 10 is way too noisy for this. |
|
Thank you for your quick feedback! It will be interesting to see the results from a bigger dataset. |
|
Hi @rgushchin , I have done some more benchmarking, again with the tiny benchmarking, usingQwen3-Coder-30B-A3B-Instruct-Q
Br, Anders |
|
Hey Anders, The changes look good, and the results are promising! You mentioned that this was done with the tiny benchmark, correct? Would you kindly elaborate?
|
|
Hi Derek, I only ran one A and B comparison per model (Sonnet and Qwen3-Coder-30B). I ran benchmarks/benchmark_tiny.json all four times. I only had the db from the last run saved. With "unrelated" I mean the findings it didn't match the specific known bug being tested for. Per-patch unrelated findings:
The before findings are fewer but more specific. Example from t7xx (3 findings total): The after findings are more numerous but vaguer. Example from t7xx (15 findings total): It looks to me that removing TodoWrite freed up turn budget by not wasting it on no-ops, but in my run the model used that budget to produce more findings rather than deeper analysis. Br, Anders Heimer |
|
thank you for the expanded findings!
this is interesting behavior, if I had to guess, the TodoWrite specified what the finding was, and then spent time aggregating/validating them? Do you have any access to the logs, if they exist? (example on prod instance via the "View Raw Log" button: https://sashiko.dev/#/log/54060)
Now I'm curious -- if adding a shim to |
Upstream review-prompts (vendored in third_party/prompts/) instruct the model to call TodoWrite for each analysis task, assuming Claude Code's native task-tracking tool. Sashiko's local todowrite implementation merely appended a line to a TODO.md file in the worktree and returned; it had no analytical effect on the review. In practice this caused the model to spend a large fraction of its per-stage turn budget issuing TodoWrite calls. In one observed Stage 3 run, 27 of 46 turns were TodoWrite invocations, which pushed the stage toward the max_interactions limit and left insufficient budget for the actual callstack analysis. Remove the TodoWrite tool entirely: the declaration in get_declarations_generic, the dispatcher entry, the todowrite method, the now-unused AsyncWriteExt import, and the companion unit test. The tool-name normalization test is updated to use a tool that still exists (list_dir). The DESIGN_REVIEW_WORKER.md references to todo_write and the Todo List state are dropped. Once the tool is no longer advertised, the upstream prompt instruction "Add each Task into a TodoWrite" becomes a harmless no-op. Measured on a 9-entry Linux-kernel benchmark set with the Kiro CLI provider running claude-sonnet-4.5: - Detection rate: 4/9 (44%) -> 6/9 (67%). - Average turns per review: 99 -> 55.7 (-43.8%). - Average output tokens per review: 17306 -> 14428 (-16.6%). - Two previously missed patches (cifs SendReceive response-buffer leak, nci command workqueue use-after-free) are now detected correctly; the nci review dropped from 159 turns with no findings to 29 turns with a correct finding. No regressions were observed: every patch detected before is still detected, with lower turn counts and comparable or smaller output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Anders Heimer <anders.heimer@est.tech>
1b67fa5 to
052b1dc
Compare
|
Thanks Derek, I had missed that aspect. Unfortunately, the initial logs from the runs I mentioned are gone. I used Kiro to collect whatever information I still had lying around, hence the odd format. Those notes are attached, I think it shows what you are after. From the logs, it does not look like TodoWrite was discovering bugs directly. Most TodoWrite calls were checklist/status churn. A few calls did contain concrete candidate-like text, but that only seems useful within the same stage unless the model also emits the candidate as a structured concern. As I understand it, later stages do not consume TODO.md as persistent review state. I tested your theory by preserving the useful part in the finding pipeline by emitting concrete suspected bugs with structured location data, for later stages. In the tiny benchmark I ran with this change, that did preserve structured locations for the final findings. The downside is that the richer structured output appears to put more pressure on the JSON path, at least with Kiro. In my partial run it caused more malformed-output/schema failures and timeout churn. I need to look in to that more in detail. |
Upstream review-prompts (vendored in third_party/prompts/) instruct the model to call TodoWrite for each analysis task, assuming Claude Code's native task-tracking tool. Sashiko's local todowrite implementation merely appended a line to a TODO.md file in the worktree and returned; it had no analytical effect on the review.
In practice this caused the model to spend a large fraction of its per-stage turn budget issuing TodoWrite calls. In one observed Stage 3 run, 27 of 46 turns were TodoWrite invocations, which pushed the stage toward the max_interactions limit and left insufficient budget for the actual callstack analysis.
Remove the TodoWrite tool entirely: the declaration in get_declarations_generic, the dispatcher entry, the todowrite method, the now-unused AsyncWriteExt import, and the companion unit test. The tool-name normalization test is updated to use a tool that still exists (list_dir). The DESIGN_REVIEW_WORKER.md references to todo_write and the Todo List state are dropped. Once the tool is no longer advertised, the upstream prompt instruction "Add each Task into a TodoWrite" becomes a harmless no-op.
Measured on a 9-entry Linux-kernel benchmark set with the Kiro CLI provider running claude-sonnet-4.5:
No regressions were observed: every patch detected before is still detected, with lower turn counts and comparable or smaller output.