Skip to content

fix(weave): make WeaveClient.flush() non-destructive to call-pair buffers#6897

Draft
gtarpenning wants to merge 1 commit into
masterfrom
gtarpenning/fix-cc-flush-timeout
Draft

fix(weave): make WeaveClient.flush() non-destructive to call-pair buffers#6897
gtarpenning wants to merge 1 commit into
masterfrom
gtarpenning/fix-cc-flush-timeout

Conversation

@gtarpenning
Copy link
Copy Markdown
Member

@gtarpenning gtarpenning commented May 20, 2026

Summary

  • WeaveClient.flush() routed through _flush(), which calls stop_accepting_new_work_and_flush_queue on every processor. That path is the shutdown path: it waits up to FLUSH_TIMEOUT_SECONDS (5 min) for in-flight starts/ends to pair into bulk-endpoint writes, orphan-sends what's left, then tears the processor thread down. Right call for shutdown — wrong call for an interactive flush() mid-evaluation, where the buffer always has an unpaired eager start (Evaluation.evaluate) and flush() stalls ~300s.
  • Adds AsyncBatchProcessor.drain_queue(timeout=60): waits for the send queue to drain, does not touch subclass-owned pending-pair state, does not stop the processor. The 5-minute pairing budget on shutdown is intentionally preserved — it's what lets pairs combine into the cheaper /v2/.../calls/complete bulk endpoint when the process is exiting.
  • Routes WeaveClient.flush() through the new path via _flush(for_shutdown=False). client.finish() keeps the old behavior.
  • Fixes WB-34557.

Reasoning — why two paths instead of just tuning the timeout

A previous draft of this PR shrank FLUSH_TIMEOUT_SECONDS from 5 * 60 to 2.0. That was wrong: the 5-minute pairing wait isn't a bug, it's a perf feature. When the SDK is shutting down, we want to give in-flight ops as long as reasonable to finish so their start+end pair into a single /calls/complete write (one network call, one CH insert) instead of getting orphan-sent through the eager /call/start + /call/end pair (two calls, two inserts, more overhead). Trimming that budget for shutdown would cost real perf on every well-behaved process exit.

The actual bug is that client.flush() — which fires repeatedly during normal use, mid-eval, in agent loops, etc. — was incorrectly using the shutdown path. Two distinct call sites, two distinct semantics:

client.finish() client.flush()
Intent "I'm done. Send everything, even orphans." "Make my already-finished writes visible. Don't disrupt in-flight calls."
Pending pairs Wait up to 5 min, then orphan-send Leave alone — they're still in-flight
Processor thread Stop + restart (one-shot pattern) Keep running
Frequency Once per process (atexit) Many times per process

The new drain_queue is the second column. It:

  1. Skips the pairing wait. The pending-pair buffer represents calls whose ends haven't been enqueued yet — by definition the user can't have meant to flush them, they're not finished writes. They stay in the buffer and pair naturally when their counterpart arrives.
  2. Doesn't stop the processor. Repeated flush() calls would otherwise pay the thread restart cost every time.
  3. Has a generous 60s safety ceiling in case the processor thread is unresponsive, but real flushes return at batch cadence (~10ms).

Options considered

Approach Why not
Drop FLUSH_TIMEOUT_SECONDS to ~2s Breaks the shutdown bulk-endpoint optimization, the very thing the timeout exists for.
Track eager call_ids and exclude them from the shutdown pairing-wait Helps the eager case but doesn't fix non-eager mid-flush. The fix-flush-vs-finish split makes this moot.
Skip pairing wait entirely (one path) Equivalent to point #1 for shutdown — same regression.
Two paths (this PR) Each call site gets the semantics it actually wants.

What this doesn't fix

There's still a separate, smaller bug worth chasing: in some scenarios the SDK orphan-sends "1 unpaired end" on shutdown even from a minimal EvaluationLogger(...) + client.flush() repro. An end ending up in _pending_ends without its start passing through _handle_start first suggests an ordering bug in eager-start enqueue. Separate change — this PR makes user-flush stop pessimizing on it but doesn't root-cause it.

Testing

unit tests: test_drain_queue_skips_pending_pair_wait asserts drain returns in <2s with pending items, leaves pending state intact, doesn't orphan-send, leaves the processor accepting work. test_drain_queue_processes_already_paired_items confirms paired items still flow through. Existing 11 tests for CallBatchProcessor unchanged.

Manually verified against qa-azure (server-release-0.81.x) with WEAVE_USE_CALLS_COMPLETE=true: mega smoke imperative_eval went from [PASS] 304.76s[PASS] 4.23s. Shutdown path on the same project still hits the 5-min budget exactly as before.

@wandbot-3000
Copy link
Copy Markdown

wandbot-3000 Bot commented May 20, 2026

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 88.88889% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ave/trace_server_bindings/async_batch_processor.py 75.00% 0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

…fers

`WeaveClient.flush()` was routing through `_flush()`, which calls
`stop_accepting_new_work_and_flush_queue` + `accept_new_work` on every
processor. That path is the *shutdown* path: it waits up to
`FLUSH_TIMEOUT_SECONDS` (5 min) for in-flight starts/ends to pair into
bulk-endpoint writes, then orphan-sends what's left and tears down the
processor thread. Reasonable for `client.finish()`. Catastrophic for
`client.flush()` mid-evaluation, where an open eager-start call (e.g.
`Evaluation.evaluate`) means there's always something unpaired in the
buffer — `flush()` stalls ~300s on every call.

Add `AsyncBatchProcessor.drain_queue(timeout=60)`: waits for the send
queue to drain, does not touch subclass-owned pending-pair state, does
not stop the processor. Route `WeaveClient.flush()` through it via a
new `_flush(for_shutdown=False)` branch. `client.finish()` keeps the
existing 5-min pairing budget (preserving the bulk-endpoint perf wins
on shutdown).

Fixes WB-34557.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gtarpenning gtarpenning force-pushed the gtarpenning/fix-cc-flush-timeout branch from 84bd068 to 9885990 Compare May 20, 2026 15:47
@gtarpenning gtarpenning changed the title fix(weave): cap CallBatchProcessor flush pairing wait at 2s fix(weave): make WeaveClient.flush() non-destructive to call-pair buffers May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant