Skip to content

Fix RemoteConversation FINISHED stop-hook race#3191

Open
neubig wants to merge 3 commits into
mainfrom
fix/remote-conversation-finished-hint
Open

Fix RemoteConversation FINISHED stop-hook race#3191
neubig wants to merge 3 commits into
mainfrom
fix/remote-conversation-finished-hint

Conversation

@neubig
Copy link
Copy Markdown
Contributor

@neubig neubig commented May 10, 2026

Summary

Fixes a RemoteConversation race where a WebSocket FINISHED event could cause the remote client to stop waiting before server-side stop hooks finished evaluating whether the conversation should actually terminate.

Before this change, RemoteConversation treated FINISHED from the WebSocket stream as authoritative. That is unsafe when stop hooks can veto completion and move the conversation back to RUNNING. The client could exit early, collect outputs, and proceed while the server-side conversation was still active.

After this change:

  • ERROR and STUCK remain immediate terminal states.
  • WebSocket FINISHED is treated as a wake-up hint.
  • REST polling confirms the current state before the client considers the conversation complete.

Motivation

This was discovered during ProgramBench end-to-end CI. ProgramBench relies on stop hooks for compile-contract and reference-output validation. When a stop hook rejects completion, the agent must keep working; early client teardown can produce premature submissions.

Successful ProgramBench CI evidence that exercised this path:

Changes

  • openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py
    • Adds explicit immediate-terminal status handling.
    • Treats WebSocket FINISHED as advisory and confirms final state through REST polling.
  • tests/sdk/conversation/remote/test_remote_conversation.py
    • Adds regression coverage for the FINISHED-vs-stop-hook race.
  • AGENTS.md
    • Documents the WebSocket/stop-hook race lesson for future work.

Verification

  • uv run pytest tests/sdk/conversation/remote/test_remote_conversation.py -q
    • 32 passed

Stacking

This PR is split out from #3075 so the critical conversation-termination behavior change can receive focused human review. The ProgramBench workflow/docs PR should be stacked on top of this branch and contain only the benchmark-choice/doc changes.

This PR was created by an AI agent (OpenHands) on behalf of the user.


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:bd2fd08-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-bd2fd08-python \
  ghcr.io/openhands/agent-server:bd2fd08-python

All tags pushed for this build

ghcr.io/openhands/agent-server:bd2fd08-golang-amd64
ghcr.io/openhands/agent-server:bd2fd081e8c113f922e9d34869439b3c4bca628c-golang-amd64
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-golang-amd64
ghcr.io/openhands/agent-server:bd2fd08-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:bd2fd08-golang-arm64
ghcr.io/openhands/agent-server:bd2fd081e8c113f922e9d34869439b3c4bca628c-golang-arm64
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-golang-arm64
ghcr.io/openhands/agent-server:bd2fd08-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:bd2fd08-java-amd64
ghcr.io/openhands/agent-server:bd2fd081e8c113f922e9d34869439b3c4bca628c-java-amd64
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-java-amd64
ghcr.io/openhands/agent-server:bd2fd08-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:bd2fd08-java-arm64
ghcr.io/openhands/agent-server:bd2fd081e8c113f922e9d34869439b3c4bca628c-java-arm64
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-java-arm64
ghcr.io/openhands/agent-server:bd2fd08-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:bd2fd08-python-amd64
ghcr.io/openhands/agent-server:bd2fd081e8c113f922e9d34869439b3c4bca628c-python-amd64
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-python-amd64
ghcr.io/openhands/agent-server:bd2fd08-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:bd2fd08-python-arm64
ghcr.io/openhands/agent-server:bd2fd081e8c113f922e9d34869439b3c4bca628c-python-arm64
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-python-arm64
ghcr.io/openhands/agent-server:bd2fd08-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:bd2fd08-golang
ghcr.io/openhands/agent-server:bd2fd081e8c113f922e9d34869439b3c4bca628c-golang
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-golang
ghcr.io/openhands/agent-server:bd2fd08-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:bd2fd08-java
ghcr.io/openhands/agent-server:bd2fd081e8c113f922e9d34869439b3c4bca628c-java
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-java
ghcr.io/openhands/agent-server:bd2fd08-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:bd2fd08-python
ghcr.io/openhands/agent-server:bd2fd081e8c113f922e9d34869439b3c4bca628c-python
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-python
ghcr.io/openhands/agent-server:bd2fd08-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., bd2fd08-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., bd2fd08-python-amd64) are also available if needed

The WS-driven termination path in `RemoteConversation._wait_for_run_completion`
returned from `run()` on the first WebSocket terminal status. That raced the
server-side stop-hook eval: when an `agent.step()` set `execution_status =
FINISHED` and the lock was released at the end of one run-loop iteration,
the next iteration could still flip status back to RUNNING via a stop hook
that denied stopping (`hook_processor.run_stop()` returning rc=2). Clients
observing the intermediate FINISHED returned, which then tore down the
agent-server pod via the benchmark's `workspace_keepalive` exit. The agent
never got to consume its iteration budget when hooks blocked.

Empirical evidence (programbench retry-16, run 25497962453):
- conv `dd86d184` (zip-password-finder): hook block at 13:30:10.936, agent
  did just 1 retry action (file_editor view) before pod teardown — far short
  of `max_iteration_per_run=200`.
- conv `e4b6892e` (zoxide): hook block at 13:33:52.082, 0 retry actions —
  conversation died on the status revert event itself.
- in-memory `history` snapshots ended at the FINISHED state update;
  on-disk `/workspace/conversations/.../events` had 5 extra events captured
  by the persist callback during the brief post-`run()` window.

Fix: WS terminal status is a fast wakeup hint only — except for ERROR/STUCK,
which are not subject to hook reverts and still terminate immediately. For
FINISHED we fall through to the existing REST-poll path, which already
requires `TERMINAL_POLL_THRESHOLD` consecutive terminal polls before
returning. Any FINISHED→RUNNING revert seen on a REST poll naturally resets
the consecutive-terminal counter, and we drain the WS hint queue at that
point so we don't busy-spin on stale FINISHED notifications.

Latency cost on the happy path (no hook block, default `poll_interval=1.0s`,
`TERMINAL_POLL_THRESHOLD=3`): ~2s additional vs. the previous immediate-
return on WS terminal — acceptable for benchmark and CLI workloads.

Adds two regression tests:
- `test_remote_conversation_run_ws_finished_is_only_a_hint_not_terminal`:
  seeds the WS queue with FINISHED while REST shows the post-hook RUNNING
  revert; asserts `run()` polls past the race window before confirming.
- `test_remote_conversation_run_ws_error_still_terminates_immediately`:
  ensures the WS fast-path is preserved for ERROR (not subject to hook
  reverts).

Refactors the queue-drain logic into `_drain_terminal_status_queue` and
isolates the WS-immediate-terminal predicate into
`_immediate_terminal_statuses`, both with docstrings explaining why
FINISHED is excluded.

Co-authored-by: openhands <openhands@all-hands.dev>
Persistent note for future sessions / agents debugging this surface so
they don't have to re-derive the diagnosis from output.jsonl artifacts.

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions
Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@github-actions
Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/conversation/impl
   remote_conversation.py66410084%75, 77, 144, 171, 184, 186–189, 199, 221–222, 227–230, 314, 324–326, 332, 406, 553–556, 558, 584–588, 593–596, 599, 615, 774–775, 779–780, 794, 820–821, 840, 851–852, 872–875, 877–878, 902–904, 907–911, 913–914, 918, 920–928, 930, 967, 1090, 1096–1097, 1111, 1220–1221, 1225, 1230–1234, 1240–1246, 1259, 1264, 1299, 1512–1513
TOTAL26076748471% 

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟢 Good taste

This is an excellent fix for a subtle but critical race condition. The implementation is clean, well-documented, and properly tested.

Code Quality Highlights:

  • Clear refactoring with _drain_terminal_status_queue() and _immediate_terminal_statuses()
  • Excellent inline documentation explaining the WebSocket hint vs REST polling pattern
  • Strong regression tests that directly reproduce the race scenario
  • Proper handling of ERROR/STUCK as immediate terminal states

Process Note: This PR changes conversation termination behavior (agent behavior that could affect benchmarks). While the PR includes GitHub Actions links to successful ProgramBench runs, the custom review guidelines require either:

  1. Links to the eval monitor (openhands-eval-monitor.vercel.app) showing completed benchmarks, and
  2. Human maintainer confirmation of the eval results

Since these are GitHub Actions links (not eval monitor) and no human has confirmed the results in the PR, I'm flagging this for human review per the eval-risk policy.

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟢 LOW

This is a correctness fix for a real production bug discovered in ProgramBench retry-16. The change makes the system more correct by treating WebSocket FINISHED events as wakeup hints rather than authoritative termination signals. This prevents premature client teardown when stop hooks veto completion.

Risk factors:

  • ✅ Well-tested with regression coverage for both the race scenario and ERROR fast-path preservation
  • ✅ Changes isolated to RemoteConversation termination logic
  • ✅ Solves a real bug (not theoretical)
  • ✅ Evidence of successful ProgramBench runs provided
  • ⚠️ Could add minor latency to FINISHED termination (now requires REST polls), but this is the correct behavior

VERDICT:
Recommend merging after human review: Core fix is sound and necessary for correct stop-hook behavior. The eval evidence shows ProgramBench now works correctly with this change.

KEY INSIGHT:
The separation of "immediate terminal statuses" (ERROR/STUCK) from "polled terminal statuses" (FINISHED) elegantly captures the semantic difference between non-revertable errors and potentially-vetoed agent completion.

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

Verified the RemoteConversation FINISHED stop-hook race fix works correctly by exercising the changed code and analyzing the implementation.

Does this PR achieve its stated goal?

Yes. This PR successfully fixes the race condition where WebSocket FINISHED events could cause RemoteConversation to terminate prematurely before server-side stop hooks finished evaluating whether the conversation should actually stop.

The fix correctly implements the behavior described in the PR: WebSocket FINISHED is now treated as a wake-up hint rather than authoritative termination, while ERROR and STUCK remain immediate terminal states. REST polling with a threshold of 3 consecutive terminal polls provides authoritative confirmation before the client considers the run complete.

Phase Result
Environment Setup ✅ Dependencies installed, project builds successfully
CI Status ✅ All critical checks passing (sdk-tests, agent-server-tests, windows-tests, cross-tests)
Functional Verification ✅ Code behavior matches specification, regression tests comprehensive
Functional Verification

Verification Approach

Created a verification script (qa_simple_verification.py) to exercise the key aspects of the fix:

Test 1: FINISHED is a hint, not authoritative termination

Verification:

uv run python qa_simple_verification.py

Result:

Test 1: FINISHED via WebSocket is a hint, not authoritative termination
----------------------------------------------------------------------
Immediate terminal statuses: frozenset({'error', 'stuck'})
✅ PASS: FINISHED is NOT in immediate terminal statuses
✅ PASS: ERROR and STUCK are immediate terminal statuses

Testing _drain_terminal_status_queue behavior...
Queue size before drain: 3
Queue size after drain: 0
✅ PASS: Queue draining correctly empties all items

What this proves: The new _immediate_terminal_statuses() method correctly excludes FINISHED from immediate termination. Only ERROR and STUCK trigger the fast WebSocket termination path. The queue draining logic prevents stale FINISHED events from causing busy-spin loops.

Test 2: ERROR and STUCK remain immediate terminal states

Result:

Test 2: ERROR and STUCK terminate immediately (not hints)
----------------------------------------------------------------------
ERROR is_terminal: True
✅ PASS: ERROR is a terminal status
STUCK is_terminal: True
✅ PASS: STUCK is a terminal status
FINISHED is_terminal: True
✅ PASS: FINISHED is a terminal status (but treated as hint in WS path)

What this proves: The fix preserves the fast termination path for ERROR and STUCK (which are not subject to stop-hook reverts), while FINISHED remains a terminal status but is handled differently in the WebSocket code path.

Test 3: REST polling confirmation pattern

Verification: Inspected _wait_for_run_completion source code to verify:

Result:

Test 3: REST polling confirmation pattern
----------------------------------------------------------------------
✅ PASS: TERMINAL_POLL_THRESHOLD is set to 3 in _wait_for_run_completion
✅ PASS: REST polling requires TERMINAL_POLL_THRESHOLD consecutive polls
   This prevents premature termination when stop hooks revert FINISHED → RUNNING
✅ PASS: Counter resets when status flips back to non-terminal (e.g., FINISHED → RUNNING)
   This correctly handles stop hook denials

What this proves:

  1. The client requires 3 consecutive FINISHED REST polls before terminating
  2. If status flips back to RUNNING (stop hook denial), the counter resets
  3. This prevents the race where WS FINISHED arrives before the server's stop hook evaluation completes

Test 4: Regression test coverage

Examined the added regression test test_remote_conversation_run_ws_finished_is_only_a_hint_not_terminal in tests/sdk/conversation/remote/test_remote_conversation.py:

Test scenario:

  1. Seeds the WS terminal-status queue with FINISHED (simulating the broadcast)
  2. Configures REST polling to return RUNNING for first 3 polls (simulating stop hook revert)
  3. Then returns FINISHED for subsequent polls (agent finishes again after hook allows it)
  4. Verifies client doesn't return immediately but waits for REST confirmation (≥6 polls)

What this proves: The regression test directly reproduces the race condition scenario observed in ProgramBench retry-16 and verifies the fix prevents premature termination.

CI Verification

Command:

gh pr view 3191 --repo OpenHands/software-agent-sdk --json statusCheckRollup,state

CI Status Summary:

  • ✅ sdk-tests: SUCCESS
  • ✅ agent-server-tests: SUCCESS
  • ✅ windows-tests: SUCCESS
  • ✅ cross-tests: SUCCESS
  • ✅ tools-tests: SUCCESS
  • ✅ workspace-tests: SUCCESS
  • ✅ Python API: SUCCESS
  • ✅ REST API (OpenAPI): SUCCESS
  • ✅ pre-commit: SUCCESS

Issues Found

None.


Summary: The fix correctly addresses the stop-hook race condition by treating WebSocket FINISHED as a wakeup hint rather than authoritative termination. REST polling with a 3-poll threshold provides reliable confirmation that the server has actually finished (and stop hooks have been evaluated). The implementation is clean, well-tested, and preserves the fast-path for ERROR/STUCK states.

Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants