fix(weave): make WeaveClient.flush() non-destructive to call-pair buffers#6897
Draft
gtarpenning wants to merge 1 commit into
Draft
fix(weave): make WeaveClient.flush() non-destructive to call-pair buffers#6897gtarpenning wants to merge 1 commit into
gtarpenning wants to merge 1 commit into
Conversation
|
Preview this PR with FeatureBee: https://beta.wandb.ai/?betaVersion=12e4175d6173ab9d75ecc5b43c806515307a5b47 |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…fers `WeaveClient.flush()` was routing through `_flush()`, which calls `stop_accepting_new_work_and_flush_queue` + `accept_new_work` on every processor. That path is the *shutdown* path: it waits up to `FLUSH_TIMEOUT_SECONDS` (5 min) for in-flight starts/ends to pair into bulk-endpoint writes, then orphan-sends what's left and tears down the processor thread. Reasonable for `client.finish()`. Catastrophic for `client.flush()` mid-evaluation, where an open eager-start call (e.g. `Evaluation.evaluate`) means there's always something unpaired in the buffer — `flush()` stalls ~300s on every call. Add `AsyncBatchProcessor.drain_queue(timeout=60)`: waits for the send queue to drain, does not touch subclass-owned pending-pair state, does not stop the processor. Route `WeaveClient.flush()` through it via a new `_flush(for_shutdown=False)` branch. `client.finish()` keeps the existing 5-min pairing budget (preserving the bulk-endpoint perf wins on shutdown). Fixes WB-34557. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
84bd068 to
9885990
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
WeaveClient.flush()routed through_flush(), which callsstop_accepting_new_work_and_flush_queueon every processor. That path is the shutdown path: it waits up toFLUSH_TIMEOUT_SECONDS(5 min) for in-flight starts/ends to pair into bulk-endpoint writes, orphan-sends what's left, then tears the processor thread down. Right call for shutdown — wrong call for an interactiveflush()mid-evaluation, where the buffer always has an unpaired eager start (Evaluation.evaluate) andflush()stalls ~300s.AsyncBatchProcessor.drain_queue(timeout=60): waits for the send queue to drain, does not touch subclass-owned pending-pair state, does not stop the processor. The 5-minute pairing budget on shutdown is intentionally preserved — it's what lets pairs combine into the cheaper/v2/.../calls/completebulk endpoint when the process is exiting.WeaveClient.flush()through the new path via_flush(for_shutdown=False).client.finish()keeps the old behavior.Reasoning — why two paths instead of just tuning the timeout
A previous draft of this PR shrank
FLUSH_TIMEOUT_SECONDSfrom5 * 60to2.0. That was wrong: the 5-minute pairing wait isn't a bug, it's a perf feature. When the SDK is shutting down, we want to give in-flight ops as long as reasonable to finish so their start+end pair into a single/calls/completewrite (one network call, one CH insert) instead of getting orphan-sent through the eager/call/start+/call/endpair (two calls, two inserts, more overhead). Trimming that budget for shutdown would cost real perf on every well-behaved process exit.The actual bug is that
client.flush()— which fires repeatedly during normal use, mid-eval, in agent loops, etc. — was incorrectly using the shutdown path. Two distinct call sites, two distinct semantics:client.finish()client.flush()The new
drain_queueis the second column. It:flush()calls would otherwise pay the thread restart cost every time.Options considered
FLUSH_TIMEOUT_SECONDSto ~2sWhat this doesn't fix
There's still a separate, smaller bug worth chasing: in some scenarios the SDK orphan-sends "1 unpaired end" on shutdown even from a minimal
EvaluationLogger(...)+client.flush()repro. An end ending up in_pending_endswithout its start passing through_handle_startfirst suggests an ordering bug in eager-start enqueue. Separate change — this PR makes user-flush stop pessimizing on it but doesn't root-cause it.Testing
unit tests:
test_drain_queue_skips_pending_pair_waitasserts drain returns in <2s with pending items, leaves pending state intact, doesn't orphan-send, leaves the processor accepting work.test_drain_queue_processes_already_paired_itemsconfirms paired items still flow through. Existing 11 tests forCallBatchProcessorunchanged.Manually verified against qa-azure (
server-release-0.81.x) withWEAVE_USE_CALLS_COMPLETE=true: mega smokeimperative_evalwent from[PASS] 304.76s→[PASS] 4.23s. Shutdown path on the same project still hits the 5-min budget exactly as before.