Fix RemoteConversation FINISHED stop-hook race#3191
Conversation
The WS-driven termination path in `RemoteConversation._wait_for_run_completion` returned from `run()` on the first WebSocket terminal status. That raced the server-side stop-hook eval: when an `agent.step()` set `execution_status = FINISHED` and the lock was released at the end of one run-loop iteration, the next iteration could still flip status back to RUNNING via a stop hook that denied stopping (`hook_processor.run_stop()` returning rc=2). Clients observing the intermediate FINISHED returned, which then tore down the agent-server pod via the benchmark's `workspace_keepalive` exit. The agent never got to consume its iteration budget when hooks blocked. Empirical evidence (programbench retry-16, run 25497962453): - conv `dd86d184` (zip-password-finder): hook block at 13:30:10.936, agent did just 1 retry action (file_editor view) before pod teardown — far short of `max_iteration_per_run=200`. - conv `e4b6892e` (zoxide): hook block at 13:33:52.082, 0 retry actions — conversation died on the status revert event itself. - in-memory `history` snapshots ended at the FINISHED state update; on-disk `/workspace/conversations/.../events` had 5 extra events captured by the persist callback during the brief post-`run()` window. Fix: WS terminal status is a fast wakeup hint only — except for ERROR/STUCK, which are not subject to hook reverts and still terminate immediately. For FINISHED we fall through to the existing REST-poll path, which already requires `TERMINAL_POLL_THRESHOLD` consecutive terminal polls before returning. Any FINISHED→RUNNING revert seen on a REST poll naturally resets the consecutive-terminal counter, and we drain the WS hint queue at that point so we don't busy-spin on stale FINISHED notifications. Latency cost on the happy path (no hook block, default `poll_interval=1.0s`, `TERMINAL_POLL_THRESHOLD=3`): ~2s additional vs. the previous immediate- return on WS terminal — acceptable for benchmark and CLI workloads. Adds two regression tests: - `test_remote_conversation_run_ws_finished_is_only_a_hint_not_terminal`: seeds the WS queue with FINISHED while REST shows the post-hook RUNNING revert; asserts `run()` polls past the race window before confirming. - `test_remote_conversation_run_ws_error_still_terminates_immediately`: ensures the WS fast-path is preserved for ERROR (not subject to hook reverts). Refactors the queue-drain logic into `_drain_terminal_status_queue` and isolates the WS-immediate-terminal predicate into `_immediate_terminal_statuses`, both with docstrings explaining why FINISHED is excluded. Co-authored-by: openhands <openhands@all-hands.dev>
Persistent note for future sessions / agents debugging this surface so they don't have to re-derive the diagnosis from output.jsonl artifacts. Co-authored-by: openhands <openhands@all-hands.dev>
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Coverage Report •
|
||||||||||||||||||||
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟢 Good taste
This is an excellent fix for a subtle but critical race condition. The implementation is clean, well-documented, and properly tested.
Code Quality Highlights:
- Clear refactoring with
_drain_terminal_status_queue()and_immediate_terminal_statuses() - Excellent inline documentation explaining the WebSocket hint vs REST polling pattern
- Strong regression tests that directly reproduce the race scenario
- Proper handling of ERROR/STUCK as immediate terminal states
Process Note: This PR changes conversation termination behavior (agent behavior that could affect benchmarks). While the PR includes GitHub Actions links to successful ProgramBench runs, the custom review guidelines require either:
- Links to the eval monitor (
openhands-eval-monitor.vercel.app) showing completed benchmarks, and - Human maintainer confirmation of the eval results
Since these are GitHub Actions links (not eval monitor) and no human has confirmed the results in the PR, I'm flagging this for human review per the eval-risk policy.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
This is a correctness fix for a real production bug discovered in ProgramBench retry-16. The change makes the system more correct by treating WebSocket FINISHED events as wakeup hints rather than authoritative termination signals. This prevents premature client teardown when stop hooks veto completion.
Risk factors:
- ✅ Well-tested with regression coverage for both the race scenario and ERROR fast-path preservation
- ✅ Changes isolated to RemoteConversation termination logic
- ✅ Solves a real bug (not theoretical)
- ✅ Evidence of successful ProgramBench runs provided
⚠️ Could add minor latency to FINISHED termination (now requires REST polls), but this is the correct behavior
VERDICT:
✅ Recommend merging after human review: Core fix is sound and necessary for correct stop-hook behavior. The eval evidence shows ProgramBench now works correctly with this change.
KEY INSIGHT:
The separation of "immediate terminal statuses" (ERROR/STUCK) from "polled terminal statuses" (FINISHED) elegantly captures the semantic difference between non-revertable errors and potentially-vetoed agent completion.
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
Verified the RemoteConversation FINISHED stop-hook race fix works correctly by exercising the changed code and analyzing the implementation.
Does this PR achieve its stated goal?
Yes. This PR successfully fixes the race condition where WebSocket FINISHED events could cause RemoteConversation to terminate prematurely before server-side stop hooks finished evaluating whether the conversation should actually stop.
The fix correctly implements the behavior described in the PR: WebSocket FINISHED is now treated as a wake-up hint rather than authoritative termination, while ERROR and STUCK remain immediate terminal states. REST polling with a threshold of 3 consecutive terminal polls provides authoritative confirmation before the client considers the run complete.
| Phase | Result |
|---|---|
| Environment Setup | ✅ Dependencies installed, project builds successfully |
| CI Status | ✅ All critical checks passing (sdk-tests, agent-server-tests, windows-tests, cross-tests) |
| Functional Verification | ✅ Code behavior matches specification, regression tests comprehensive |
Functional Verification
Verification Approach
Created a verification script (qa_simple_verification.py) to exercise the key aspects of the fix:
Test 1: FINISHED is a hint, not authoritative termination
Verification:
uv run python qa_simple_verification.pyResult:
Test 1: FINISHED via WebSocket is a hint, not authoritative termination
----------------------------------------------------------------------
Immediate terminal statuses: frozenset({'error', 'stuck'})
✅ PASS: FINISHED is NOT in immediate terminal statuses
✅ PASS: ERROR and STUCK are immediate terminal statuses
Testing _drain_terminal_status_queue behavior...
Queue size before drain: 3
Queue size after drain: 0
✅ PASS: Queue draining correctly empties all items
What this proves: The new _immediate_terminal_statuses() method correctly excludes FINISHED from immediate termination. Only ERROR and STUCK trigger the fast WebSocket termination path. The queue draining logic prevents stale FINISHED events from causing busy-spin loops.
Test 2: ERROR and STUCK remain immediate terminal states
Result:
Test 2: ERROR and STUCK terminate immediately (not hints)
----------------------------------------------------------------------
ERROR is_terminal: True
✅ PASS: ERROR is a terminal status
STUCK is_terminal: True
✅ PASS: STUCK is a terminal status
FINISHED is_terminal: True
✅ PASS: FINISHED is a terminal status (but treated as hint in WS path)
What this proves: The fix preserves the fast termination path for ERROR and STUCK (which are not subject to stop-hook reverts), while FINISHED remains a terminal status but is handled differently in the WebSocket code path.
Test 3: REST polling confirmation pattern
Verification: Inspected _wait_for_run_completion source code to verify:
Result:
Test 3: REST polling confirmation pattern
----------------------------------------------------------------------
✅ PASS: TERMINAL_POLL_THRESHOLD is set to 3 in _wait_for_run_completion
✅ PASS: REST polling requires TERMINAL_POLL_THRESHOLD consecutive polls
This prevents premature termination when stop hooks revert FINISHED → RUNNING
✅ PASS: Counter resets when status flips back to non-terminal (e.g., FINISHED → RUNNING)
This correctly handles stop hook denials
What this proves:
- The client requires 3 consecutive
FINISHEDREST polls before terminating - If status flips back to
RUNNING(stop hook denial), the counter resets - This prevents the race where WS
FINISHEDarrives before the server's stop hook evaluation completes
Test 4: Regression test coverage
Examined the added regression test test_remote_conversation_run_ws_finished_is_only_a_hint_not_terminal in tests/sdk/conversation/remote/test_remote_conversation.py:
Test scenario:
- Seeds the WS terminal-status queue with
FINISHED(simulating the broadcast) - Configures REST polling to return
RUNNINGfor first 3 polls (simulating stop hook revert) - Then returns
FINISHEDfor subsequent polls (agent finishes again after hook allows it) - Verifies client doesn't return immediately but waits for REST confirmation (≥6 polls)
What this proves: The regression test directly reproduces the race condition scenario observed in ProgramBench retry-16 and verifies the fix prevents premature termination.
CI Verification
Command:
gh pr view 3191 --repo OpenHands/software-agent-sdk --json statusCheckRollup,stateCI Status Summary:
- ✅ sdk-tests: SUCCESS
- ✅ agent-server-tests: SUCCESS
- ✅ windows-tests: SUCCESS
- ✅ cross-tests: SUCCESS
- ✅ tools-tests: SUCCESS
- ✅ workspace-tests: SUCCESS
- ✅ Python API: SUCCESS
- ✅ REST API (OpenAPI): SUCCESS
- ✅ pre-commit: SUCCESS
Issues Found
None.
Summary: The fix correctly addresses the stop-hook race condition by treating WebSocket FINISHED as a wakeup hint rather than authoritative termination. REST polling with a 3-poll threshold provides reliable confirmation that the server has actually finished (and stop hooks have been evaluated). The implementation is clean, well-tested, and preserves the fast-path for ERROR/STUCK states.
Co-authored-by: openhands <openhands@all-hands.dev>
Summary
Fixes a RemoteConversation race where a WebSocket
FINISHEDevent could cause the remote client to stop waiting before server-side stop hooks finished evaluating whether the conversation should actually terminate.Before this change,
RemoteConversationtreatedFINISHEDfrom the WebSocket stream as authoritative. That is unsafe when stop hooks can veto completion and move the conversation back toRUNNING. The client could exit early, collect outputs, and proceed while the server-side conversation was still active.After this change:
ERRORandSTUCKremain immediate terminal states.FINISHEDis treated as a wake-up hint.Motivation
This was discovered during ProgramBench end-to-end CI. ProgramBench relies on stop hooks for compile-contract and reference-output validation. When a stop hook rejects completion, the agent must keep working; early client teardown can produce premature submissions.
Successful ProgramBench CI evidence that exercised this path:
Changes
openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.pyFINISHEDas advisory and confirms final state through REST polling.tests/sdk/conversation/remote/test_remote_conversation.pyFINISHED-vs-stop-hook race.AGENTS.mdVerification
uv run pytest tests/sdk/conversation/remote/test_remote_conversation.py -q32 passedStacking
This PR is split out from #3075 so the critical conversation-termination behavior change can receive focused human review. The ProgramBench workflow/docs PR should be stacked on top of this branch and contain only the benchmark-choice/doc changes.
This PR was created by an AI agent (OpenHands) on behalf of the user.
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:bd2fd08-pythonRun
All tags pushed for this build
About Multi-Architecture Support
bd2fd08-python) is a multi-arch manifest supporting both amd64 and arm64bd2fd08-python-amd64) are also available if needed