worker: remove TodoWrite tool by anders-heimer · Pull Request #176 · sashiko-dev/sashiko

anders-heimer · 2026-05-11T17:57:26Z

Upstream review-prompts (vendored in third_party/prompts/) instruct the model to call TodoWrite for each analysis task, assuming Claude Code's native task-tracking tool. Sashiko's local todowrite implementation merely appended a line to a TODO.md file in the worktree and returned; it had no analytical effect on the review.

In practice this caused the model to spend a large fraction of its per-stage turn budget issuing TodoWrite calls. In one observed Stage 3 run, 27 of 46 turns were TodoWrite invocations, which pushed the stage toward the max_interactions limit and left insufficient budget for the actual callstack analysis.

Remove the TodoWrite tool entirely: the declaration in get_declarations_generic, the dispatcher entry, the todowrite method, the now-unused AsyncWriteExt import, and the companion unit test. The tool-name normalization test is updated to use a tool that still exists (list_dir). The DESIGN_REVIEW_WORKER.md references to todo_write and the Todo List state are dropped. Once the tool is no longer advertised, the upstream prompt instruction "Add each Task into a TodoWrite" becomes a harmless no-op.

Measured on a 9-entry Linux-kernel benchmark set with the Kiro CLI provider running claude-sonnet-4.5:

Detection rate: 4/9 (44%) -> 6/9 (67%).
Average turns per review: 99 -> 55.7 (-43.8%).
Average output tokens per review: 17306 -> 14428 (-16.6%).
Two previously missed patches (cifs SendReceive response-buffer leak, nci command workqueue use-after-free) are now detected correctly; the nci review dropped from 159 turns with no findings to 29 turns with a correct finding.

No regressions were observed: every patch detected before is still detected, with lower turn counts and comparable or smaller output.

rgushchin · 2026-05-11T19:18:52Z

Thank you for the idea! I generally agree with the direction, but last time I tested it, it showed some regression. I'll run it with gemini and on a 1000 dataset - 10 is way too noisy for this.

anders-heimer · 2026-05-12T12:25:41Z

Thank you for your quick feedback!

It will be interesting to see the results from a bigger dataset.

anders-heimer · 2026-05-19T13:41:12Z

Hi @rgushchin ,

I have done some more benchmarking, again with the tiny benchmarking, usingQwen3-Coder-30B-A3B-Instruct-Q
5_K_M on an L40S. As you say the measurements are noisy, but the improvements makes sense theoretically at least.

Metric	After (TodoWrite removed)	Before
Detected	4/9	3/9
Missed	5/9	6/9
Runtime	5,430s (1.5h)	20,825s (5.8h)
Avg duration/patch	603s	2,314s
Total turns	949	1,334
Avg turns/patch	105.4	148.2
Format validation failures	343	306
Aborted reviews	24	23
Full review restarts	20	18

Br, Anders

derekbarbosa · 2026-05-19T17:43:01Z

Hey Anders,

The changes look good, and the results are promising!

You mentioned that this was done with the tiny benchmark, correct? Would you kindly elaborate?

How many trials of the tiny benchmark were performed (before/after)
Did you store the output results of the model's findings anywhere? I see detected & missed as entries there, but I am also curious if the review "output" response blabbed on about anything unrelated in either case (false positives)?

anders-heimer · 2026-05-19T18:41:40Z

Hi Derek,

I only ran one A and B comparison per model (Sonnet and Qwen3-Coder-30B). I ran benchmarks/benchmark_tiny.json all four times.

I only had the db from the last run saved. With "unrelated" I mean the findings it didn't match the specific known bug being tested for.

Per-patch unrelated findings:

Target bug	Before	After
btrfs: stripe_uptodate_bitmap memleak	0	2
cgroup: NULL deref on dmem.max write	5	10
cgroup: RCU usage without read lock	5	9
nfc: workqueue UAF after rfkill	5	7
t7xx: MAX_SKB_FRAGS overflow	3	15
smb: response buffer memleak	1	2
smb: SendReceive() buffer leak	0	0
nvme: swapped sgl parameters	0	0
flex_proportions: deadlock	0	0

The before findings are fewer but more specific. Example from t7xx (3 findings total):
"In t7xx_dpmaif_rx_buf_alloc(), the error path at line 264 uses while (--i > 0) which skips unmapping index 0..."

The after findings are more numerous but vaguer. Example from t7xx (15 findings total):
"Hardware register access without power domain/clk checks"
"Potential integer overflow in ring buffer calculations"

It looks to me that removing TodoWrite freed up turn budget by not wasting it on no-ops, but in my run the model used that budget to produce more findings rather than deeper analysis.

Br, Anders Heimer

derekbarbosa · 2026-05-20T12:13:59Z

thank you for the expanded findings!

The before findings are fewer but more specific. Example from t7xx (3 findings total):
"In t7xx_dpmaif_rx_buf_alloc(), the error path at line 264 uses while (--i > 0) which skips unmapping index 0..."

The after findings are more numerous but vaguer. Example from t7xx (15 findings total):
"Hardware register access without power domain/clk checks"
"Potential integer overflow in ring buffer calculations"

this is interesting behavior, if I had to guess, the TodoWrite specified what the finding was, and then spent time aggregating/validating them?

Do you have any access to the logs, if they exist? (example on prod instance via the "View Raw Log" button: https://sashiko.dev/#/log/54060)

It looks to me that removing TodoWrite freed up turn budget by not wasting it on no-ops, but in my run the model used that budget to produce more findings rather than deeper analysis.

Now I'm curious -- if adding a shim to prompts.rs, preferably one of the review stages (say stages 8/9/10) may "massage" out more precise findings. Something akin to "please be specific: elaborate what functions and what line numbers trigger the error"

Upstream review-prompts (vendored in third_party/prompts/) instruct the model to call TodoWrite for each analysis task, assuming Claude Code's native task-tracking tool. Sashiko's local todowrite implementation merely appended a line to a TODO.md file in the worktree and returned; it had no analytical effect on the review. In practice this caused the model to spend a large fraction of its per-stage turn budget issuing TodoWrite calls. In one observed Stage 3 run, 27 of 46 turns were TodoWrite invocations, which pushed the stage toward the max_interactions limit and left insufficient budget for the actual callstack analysis. Remove the TodoWrite tool entirely: the declaration in get_declarations_generic, the dispatcher entry, the todowrite method, the now-unused AsyncWriteExt import, and the companion unit test. The tool-name normalization test is updated to use a tool that still exists (list_dir). The DESIGN_REVIEW_WORKER.md references to todo_write and the Todo List state are dropped. Once the tool is no longer advertised, the upstream prompt instruction "Add each Task into a TodoWrite" becomes a harmless no-op. Measured on a 9-entry Linux-kernel benchmark set with the Kiro CLI provider running claude-sonnet-4.5: - Detection rate: 4/9 (44%) -> 6/9 (67%). - Average turns per review: 99 -> 55.7 (-43.8%). - Average output tokens per review: 17306 -> 14428 (-16.6%). - Two previously missed patches (cifs SendReceive response-buffer leak, nci command workqueue use-after-free) are now detected correctly; the nci review dropped from 159 turns with no findings to 29 turns with a correct finding. No regressions were observed: every patch detected before is still detected, with lower turn counts and comparable or smaller output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Anders Heimer <anders.heimer@est.tech>

anders-heimer · 2026-05-21T13:13:46Z

Thanks Derek, I had missed that aspect.

Unfortunately, the initial logs from the runs I mentioned are gone. I used Kiro to collect whatever information I still had lying around, hence the odd format. Those notes are attached, I think it shows what you are after.

From the logs, it does not look like TodoWrite was discovering bugs directly. Most TodoWrite calls were checklist/status churn. A few calls did contain concrete candidate-like text, but that only seems useful within the same stage unless the model also emits the candidate as a structured concern. As I understand it, later stages do not consume TODO.md as persistent review state.

I tested your theory by preserving the useful part in the finding pipeline by emitting concrete suspected bugs with structured location data, for later stages. In the tiny benchmark I ran with this change, that did preserve structured locations for the final findings.

The downside is that the richer structured output appears to put more pressure on the JSON path, at least with Kiro. In my partial run it caused more malformed-output/schema failures and timeout churn. I need to look in to that more in detail.
review-7-patchset-1.html
review-2-patchset-5.html
review-8-patchset-7.html
review-1-patchset-3.html
review-5-patchset-8.html
review-6-patchset-4.html
review-3-patchset-6.html
review-4-patchset-2.html
review-1-patchset-1.html
review-2-patchset-2.html
review-5-patchset-1.html
review-6-patchset-6.html

anders-heimer mentioned this pull request May 19, 2026

Required optimization on local review with claude cli #184

Open

anders-heimer force-pushed the remove-todo-write-tool branch from 1b67fa5 to 052b1dc Compare May 21, 2026 04:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

worker: remove TodoWrite tool#176

worker: remove TodoWrite tool#176
anders-heimer wants to merge 1 commit into
sashiko-dev:mainfrom
anders-heimer:remove-todo-write-tool

anders-heimer commented May 11, 2026

Uh oh!

rgushchin commented May 11, 2026

Uh oh!

anders-heimer commented May 12, 2026

Uh oh!

anders-heimer commented May 19, 2026

Uh oh!

derekbarbosa commented May 19, 2026

Uh oh!

anders-heimer commented May 19, 2026 •

edited

Loading

Uh oh!

derekbarbosa commented May 20, 2026 •

edited

Loading

Uh oh!

anders-heimer commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

anders-heimer commented May 11, 2026

Uh oh!

rgushchin commented May 11, 2026

Uh oh!

anders-heimer commented May 12, 2026

Uh oh!

anders-heimer commented May 19, 2026

Uh oh!

derekbarbosa commented May 19, 2026

Uh oh!

anders-heimer commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

derekbarbosa commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anders-heimer commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anders-heimer commented May 19, 2026 •

edited

Loading

derekbarbosa commented May 20, 2026 •

edited

Loading