fix(weave): make WeaveClient.flush() non-destructive to call-pair buffers by gtarpenning · Pull Request #6897 · wandb/weave

gtarpenning · 2026-05-20T00:15:54Z

Summary

WeaveClient.flush() routed through _flush(), which calls stop_accepting_new_work_and_flush_queue on every processor. That path is the shutdown path: it waits up to FLUSH_TIMEOUT_SECONDS (5 min) for in-flight starts/ends to pair into bulk-endpoint writes, orphan-sends what's left, then tears the processor thread down. Right call for shutdown — wrong call for an interactive flush() mid-evaluation, where the buffer always has an unpaired eager start (Evaluation.evaluate) and flush() stalls ~300s.
Adds AsyncBatchProcessor.drain_queue(timeout=60): waits for the send queue to drain, does not touch subclass-owned pending-pair state, does not stop the processor. The 5-minute pairing budget on shutdown is intentionally preserved — it's what lets pairs combine into the cheaper /v2/.../calls/complete bulk endpoint when the process is exiting.
Routes WeaveClient.flush() through the new path via _flush(for_shutdown=False). client.finish() keeps the old behavior.
Fixes WB-34557.

Reasoning — why two paths instead of just tuning the timeout

A previous draft of this PR shrank FLUSH_TIMEOUT_SECONDS from 5 * 60 to 2.0. That was wrong: the 5-minute pairing wait isn't a bug, it's a perf feature. When the SDK is shutting down, we want to give in-flight ops as long as reasonable to finish so their start+end pair into a single /calls/complete write (one network call, one CH insert) instead of getting orphan-sent through the eager /call/start + /call/end pair (two calls, two inserts, more overhead). Trimming that budget for shutdown would cost real perf on every well-behaved process exit.

The actual bug is that client.flush() — which fires repeatedly during normal use, mid-eval, in agent loops, etc. — was incorrectly using the shutdown path. Two distinct call sites, two distinct semantics:

	`client.finish()`	`client.flush()`
Intent	"I'm done. Send everything, even orphans."	"Make my already-finished writes visible. Don't disrupt in-flight calls."
Pending pairs	Wait up to 5 min, then orphan-send	Leave alone — they're still in-flight
Processor thread	Stop + restart (one-shot pattern)	Keep running
Frequency	Once per process (atexit)	Many times per process

The new drain_queue is the second column. It:

Skips the pairing wait. The pending-pair buffer represents calls whose ends haven't been enqueued yet — by definition the user can't have meant to flush them, they're not finished writes. They stay in the buffer and pair naturally when their counterpart arrives.
Doesn't stop the processor. Repeated flush() calls would otherwise pay the thread restart cost every time.
Has a generous 60s safety ceiling in case the processor thread is unresponsive, but real flushes return at batch cadence (~10ms).

Options considered

Approach	Why not
Drop `FLUSH_TIMEOUT_SECONDS` to ~2s	Breaks the shutdown bulk-endpoint optimization, the very thing the timeout exists for.
Track eager call_ids and exclude them from the shutdown pairing-wait	Helps the eager case but doesn't fix non-eager mid-flush. The fix-flush-vs-finish split makes this moot.
Skip pairing wait entirely (one path)	Equivalent to point #1 for shutdown — same regression.
Two paths (this PR)	Each call site gets the semantics it actually wants.

What this doesn't fix

There's still a separate, smaller bug worth chasing: in some scenarios the SDK orphan-sends "1 unpaired end" on shutdown even from a minimal EvaluationLogger(...) + client.flush() repro. An end ending up in _pending_ends without its start passing through _handle_start first suggests an ordering bug in eager-start enqueue. Separate change — this PR makes user-flush stop pessimizing on it but doesn't root-cause it.

Testing

unit tests: test_drain_queue_skips_pending_pair_wait asserts drain returns in <2s with pending items, leaves pending state intact, doesn't orphan-send, leaves the processor accepting work. test_drain_queue_processes_already_paired_items confirms paired items still flow through. Existing 11 tests for CallBatchProcessor unchanged.

Manually verified against qa-azure (server-release-0.81.x) with WEAVE_USE_CALLS_COMPLETE=true: mega smoke imperative_eval went from [PASS] 304.76s → [PASS] 4.23s. Shutdown path on the same project still hits the 5-min budget exactly as before.

wandbot-3000 · 2026-05-20T00:20:21Z

Preview this PR with FeatureBee: https://beta.wandb.ai/?betaVersion=12e4175d6173ab9d75ecc5b43c806515307a5b47

codecov · 2026-05-20T00:28:03Z

Codecov Report

❌ Patch coverage is 88.88889% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...ave/trace_server_bindings/async_batch_processor.py	75.00%	0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

…fers `WeaveClient.flush()` was routing through `_flush()`, which calls `stop_accepting_new_work_and_flush_queue` + `accept_new_work` on every processor. That path is the *shutdown* path: it waits up to `FLUSH_TIMEOUT_SECONDS` (5 min) for in-flight starts/ends to pair into bulk-endpoint writes, then orphan-sends what's left and tears down the processor thread. Reasonable for `client.finish()`. Catastrophic for `client.flush()` mid-evaluation, where an open eager-start call (e.g. `Evaluation.evaluate`) means there's always something unpaired in the buffer — `flush()` stalls ~300s on every call. Add `AsyncBatchProcessor.drain_queue(timeout=60)`: waits for the send queue to drain, does not touch subclass-owned pending-pair state, does not stop the processor. Route `WeaveClient.flush()` through it via a new `_flush(for_shutdown=False)` branch. `client.finish()` keeps the existing 5-min pairing budget (preserving the bulk-endpoint perf wins on shutdown). Fixes WB-34557. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gtarpenning force-pushed the gtarpenning/fix-cc-flush-timeout branch from 84bd068 to 9885990 Compare May 20, 2026 15:47

gtarpenning changed the title ~~fix(weave): cap CallBatchProcessor flush pairing wait at 2s~~ fix(weave): make WeaveClient.flush() non-destructive to call-pair buffers May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(weave): make WeaveClient.flush() non-destructive to call-pair buffers#6897

fix(weave): make WeaveClient.flush() non-destructive to call-pair buffers#6897
gtarpenning wants to merge 1 commit into
masterfrom
gtarpenning/fix-cc-flush-timeout

gtarpenning commented May 20, 2026 •

edited by atlassian Bot

Loading

Uh oh!

wandbot-3000 Bot commented May 20, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gtarpenning commented May 20, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reasoning — why two paths instead of just tuning the timeout

Options considered

What this doesn't fix

Testing

Uh oh!

wandbot-3000 Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gtarpenning commented May 20, 2026 •

edited by atlassian Bot

Loading

wandbot-3000 Bot commented May 20, 2026 •

edited

Loading

codecov Bot commented May 20, 2026 •

edited

Loading