Fix RemoteConversation FINISHED stop-hook race by neubig · Pull Request #3191 · OpenHands/software-agent-sdk

neubig · 2026-05-10T15:56:57Z

Summary

Fixes a RemoteConversation race where a WebSocket FINISHED event could cause the remote client to stop waiting before server-side stop hooks finished evaluating whether the conversation should actually terminate.

Before this change, RemoteConversation treated FINISHED from the WebSocket stream as authoritative. That is unsafe when stop hooks can veto completion and move the conversation back to RUNNING. The client could exit early, collect outputs, and proceed while the server-side conversation was still active.

After this change:

ERROR and STUCK remain immediate terminal states.
WebSocket FINISHED is treated as a wake-up hint.
REST polling confirms the current state before the client considers the conversation complete.

Motivation

This was discovered during ProgramBench end-to-end CI. ProgramBench relies on stop hooks for compile-contract and reference-output validation. When a stop hook rejects completion, the agent must keep working; early client teardown can produce premature submissions.

Successful ProgramBench CI evidence that exercised this path:

Parent eval job: https://github.com/OpenHands/evaluation/actions/runs/25552125670 — success.
Downstream inference phase: https://github.com/OpenHands/evaluation/actions/runs/25552168816 — success.
Downstream eval harness phase: https://github.com/OpenHands/evaluation/actions/runs/25552772174 — success.

Changes

openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py
- Adds explicit immediate-terminal status handling.
- Treats WebSocket FINISHED as advisory and confirms final state through REST polling.
tests/sdk/conversation/remote/test_remote_conversation.py
- Adds regression coverage for the FINISHED-vs-stop-hook race.
AGENTS.md
- Documents the WebSocket/stop-hook race lesson for future work.

Verification

uv run pytest tests/sdk/conversation/remote/test_remote_conversation.py -q
- 32 passed

Stacking

This PR is split out from #3075 so the critical conversation-termination behavior change can receive focused human review. The ProgramBench workflow/docs PR should be stacked on top of this branch and contain only the benchmark-choice/doc changes.

This PR was created by an AI agent (OpenHands) on behalf of the user.

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22-slim`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:bd2fd08-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-bd2fd08-python \
  ghcr.io/openhands/agent-server:bd2fd08-python

All tags pushed for this build

ghcr.io/openhands/agent-server:bd2fd08-golang-amd64
ghcr.io/openhands/agent-server:bd2fd081e8c113f922e9d34869439b3c4bca628c-golang-amd64
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-golang-amd64
ghcr.io/openhands/agent-server:bd2fd08-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:bd2fd08-golang-arm64
ghcr.io/openhands/agent-server:bd2fd081e8c113f922e9d34869439b3c4bca628c-golang-arm64
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-golang-arm64
ghcr.io/openhands/agent-server:bd2fd08-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:bd2fd08-java-amd64
ghcr.io/openhands/agent-server:bd2fd081e8c113f922e9d34869439b3c4bca628c-java-amd64
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-java-amd64
ghcr.io/openhands/agent-server:bd2fd08-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:bd2fd08-java-arm64
ghcr.io/openhands/agent-server:bd2fd081e8c113f922e9d34869439b3c4bca628c-java-arm64
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-java-arm64
ghcr.io/openhands/agent-server:bd2fd08-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:bd2fd08-python-amd64
ghcr.io/openhands/agent-server:bd2fd081e8c113f922e9d34869439b3c4bca628c-python-amd64
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-python-amd64
ghcr.io/openhands/agent-server:bd2fd08-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:bd2fd08-python-arm64
ghcr.io/openhands/agent-server:bd2fd081e8c113f922e9d34869439b3c4bca628c-python-arm64
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-python-arm64
ghcr.io/openhands/agent-server:bd2fd08-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:bd2fd08-golang
ghcr.io/openhands/agent-server:bd2fd081e8c113f922e9d34869439b3c4bca628c-golang
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-golang
ghcr.io/openhands/agent-server:bd2fd08-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:bd2fd08-java
ghcr.io/openhands/agent-server:bd2fd081e8c113f922e9d34869439b3c4bca628c-java
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-java
ghcr.io/openhands/agent-server:bd2fd08-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:bd2fd08-python
ghcr.io/openhands/agent-server:bd2fd081e8c113f922e9d34869439b3c4bca628c-python
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-python
ghcr.io/openhands/agent-server:bd2fd08-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

Each variant tag (e.g., bd2fd08-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., bd2fd08-python-amd64) are also available if needed

The WS-driven termination path in `RemoteConversation._wait_for_run_completion` returned from `run()` on the first WebSocket terminal status. That raced the server-side stop-hook eval: when an `agent.step()` set `execution_status = FINISHED` and the lock was released at the end of one run-loop iteration, the next iteration could still flip status back to RUNNING via a stop hook that denied stopping (`hook_processor.run_stop()` returning rc=2). Clients observing the intermediate FINISHED returned, which then tore down the agent-server pod via the benchmark's `workspace_keepalive` exit. The agent never got to consume its iteration budget when hooks blocked. Empirical evidence (programbench retry-16, run 25497962453): - conv `dd86d184` (zip-password-finder): hook block at 13:30:10.936, agent did just 1 retry action (file_editor view) before pod teardown — far short of `max_iteration_per_run=200`. - conv `e4b6892e` (zoxide): hook block at 13:33:52.082, 0 retry actions — conversation died on the status revert event itself. - in-memory `history` snapshots ended at the FINISHED state update; on-disk `/workspace/conversations/.../events` had 5 extra events captured by the persist callback during the brief post-`run()` window. Fix: WS terminal status is a fast wakeup hint only — except for ERROR/STUCK, which are not subject to hook reverts and still terminate immediately. For FINISHED we fall through to the existing REST-poll path, which already requires `TERMINAL_POLL_THRESHOLD` consecutive terminal polls before returning. Any FINISHED→RUNNING revert seen on a REST poll naturally resets the consecutive-terminal counter, and we drain the WS hint queue at that point so we don't busy-spin on stale FINISHED notifications. Latency cost on the happy path (no hook block, default `poll_interval=1.0s`, `TERMINAL_POLL_THRESHOLD=3`): ~2s additional vs. the previous immediate- return on WS terminal — acceptable for benchmark and CLI workloads. Adds two regression tests: - `test_remote_conversation_run_ws_finished_is_only_a_hint_not_terminal`: seeds the WS queue with FINISHED while REST shows the post-hook RUNNING revert; asserts `run()` polls past the race window before confirming. - `test_remote_conversation_run_ws_error_still_terminates_immediately`: ensures the WS fast-path is preserved for ERROR (not subject to hook reverts). Refactors the queue-drain logic into `_drain_terminal_status_queue` and isolates the WS-immediate-terminal predicate into `_immediate_terminal_statuses`, both with docstrings explaining why FINISHED is excluded. Co-authored-by: openhands <openhands@all-hands.dev>

Persistent note for future sessions / agents debugging this surface so they don't have to re-derive the diagnosis from output.jsonl artifacts. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-05-10T15:57:29Z

Python API breakage checks — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-05-10T15:57:34Z

REST API breakage checks (OpenAPI) — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-05-10T15:59:16Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/conversation/impl
remote_conversation.py	664	100	84%	75, 77, 144, 171, 184, 186–189, 199, 221–222, 227–230, 314, 324–326, 332, 406, 553–556, 558, 584–588, 593–596, 599, 615, 774–775, 779–780, 794, 820–821, 840, 851–852, 872–875, 877–878, 902–904, 907–911, 913–914, 918, 920–928, 930, 967, 1090, 1096–1097, 1111, 1220–1221, 1225, 1230–1234, 1240–1246, 1259, 1264, 1299, 1512–1513
TOTAL	26076	7484	71%

all-hands-bot

Taste Rating: 🟢 Good taste

This is an excellent fix for a subtle but critical race condition. The implementation is clean, well-documented, and properly tested.

Code Quality Highlights:

Clear refactoring with _drain_terminal_status_queue() and _immediate_terminal_statuses()
Excellent inline documentation explaining the WebSocket hint vs REST polling pattern
Strong regression tests that directly reproduce the race scenario
Proper handling of ERROR/STUCK as immediate terminal states

Process Note: This PR changes conversation termination behavior (agent behavior that could affect benchmarks). While the PR includes GitHub Actions links to successful ProgramBench runs, the custom review guidelines require either:

Links to the eval monitor (openhands-eval-monitor.vercel.app) showing completed benchmarks, and
Human maintainer confirmation of the eval results

Since these are GitHub Actions links (not eval monitor) and no human has confirmed the results in the PR, I'm flagging this for human review per the eval-risk policy.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟢 LOW

This is a correctness fix for a real production bug discovered in ProgramBench retry-16. The change makes the system more correct by treating WebSocket FINISHED events as wakeup hints rather than authoritative termination signals. This prevents premature client teardown when stop hooks veto completion.

Risk factors:

✅ Well-tested with regression coverage for both the race scenario and ERROR fast-path preservation
✅ Changes isolated to RemoteConversation termination logic
✅ Solves a real bug (not theoretical)
✅ Evidence of successful ProgramBench runs provided
⚠️ Could add minor latency to FINISHED termination (now requires REST polls), but this is the correct behavior

VERDICT:
✅ Recommend merging after human review: Core fix is sound and necessary for correct stop-hook behavior. The eval evidence shows ProgramBench now works correctly with this change.

KEY INSIGHT:
The separation of "immediate terminal statuses" (ERROR/STUCK) from "polled terminal statuses" (FINISHED) elegantly captures the semantic difference between non-revertable errors and potentially-vetoed agent completion.

all-hands-bot

✅ QA Report: PASS

Verified the RemoteConversation FINISHED stop-hook race fix works correctly by exercising the changed code and analyzing the implementation.

Does this PR achieve its stated goal?

Yes. This PR successfully fixes the race condition where WebSocket FINISHED events could cause RemoteConversation to terminate prematurely before server-side stop hooks finished evaluating whether the conversation should actually stop.

The fix correctly implements the behavior described in the PR: WebSocket FINISHED is now treated as a wake-up hint rather than authoritative termination, while ERROR and STUCK remain immediate terminal states. REST polling with a threshold of 3 consecutive terminal polls provides authoritative confirmation before the client considers the run complete.

Phase	Result
Environment Setup	✅ Dependencies installed, project builds successfully
CI Status	✅ All critical checks passing (sdk-tests, agent-server-tests, windows-tests, cross-tests)
Functional Verification	✅ Code behavior matches specification, regression tests comprehensive

Functional Verification

Verification Approach

Created a verification script (qa_simple_verification.py) to exercise the key aspects of the fix:

Test 1: FINISHED is a hint, not authoritative termination

Verification:

uv run python qa_simple_verification.py

Result:

Test 1: FINISHED via WebSocket is a hint, not authoritative termination
----------------------------------------------------------------------
Immediate terminal statuses: frozenset({'error', 'stuck'})
✅ PASS: FINISHED is NOT in immediate terminal statuses
✅ PASS: ERROR and STUCK are immediate terminal statuses

Testing _drain_terminal_status_queue behavior...
Queue size before drain: 3
Queue size after drain: 0
✅ PASS: Queue draining correctly empties all items

What this proves: The new _immediate_terminal_statuses() method correctly excludes FINISHED from immediate termination. Only ERROR and STUCK trigger the fast WebSocket termination path. The queue draining logic prevents stale FINISHED events from causing busy-spin loops.

Test 2: ERROR and STUCK remain immediate terminal states

Result:

Test 2: ERROR and STUCK terminate immediately (not hints)
----------------------------------------------------------------------
ERROR is_terminal: True
✅ PASS: ERROR is a terminal status
STUCK is_terminal: True
✅ PASS: STUCK is a terminal status
FINISHED is_terminal: True
✅ PASS: FINISHED is a terminal status (but treated as hint in WS path)

What this proves: The fix preserves the fast termination path for ERROR and STUCK (which are not subject to stop-hook reverts), while FINISHED remains a terminal status but is handled differently in the WebSocket code path.

Test 3: REST polling confirmation pattern

Verification: Inspected _wait_for_run_completion source code to verify:

Result:

Test 3: REST polling confirmation pattern
----------------------------------------------------------------------
✅ PASS: TERMINAL_POLL_THRESHOLD is set to 3 in _wait_for_run_completion
✅ PASS: REST polling requires TERMINAL_POLL_THRESHOLD consecutive polls
   This prevents premature termination when stop hooks revert FINISHED → RUNNING
✅ PASS: Counter resets when status flips back to non-terminal (e.g., FINISHED → RUNNING)
   This correctly handles stop hook denials

What this proves:

The client requires 3 consecutive FINISHED REST polls before terminating
If status flips back to RUNNING (stop hook denial), the counter resets
This prevents the race where WS FINISHED arrives before the server's stop hook evaluation completes

Test 4: Regression test coverage

Examined the added regression test test_remote_conversation_run_ws_finished_is_only_a_hint_not_terminal in tests/sdk/conversation/remote/test_remote_conversation.py:

Test scenario:

Seeds the WS terminal-status queue with FINISHED (simulating the broadcast)
Configures REST polling to return RUNNING for first 3 polls (simulating stop hook revert)
Then returns FINISHED for subsequent polls (agent finishes again after hook allows it)
Verifies client doesn't return immediately but waits for REST confirmation (≥6 polls)

What this proves: The regression test directly reproduces the race condition scenario observed in ProgramBench retry-16 and verifies the fix prevents premature termination.

CI Verification

Command:

gh pr view 3191 --repo OpenHands/software-agent-sdk --json statusCheckRollup,state

CI Status Summary:

✅ sdk-tests: SUCCESS
✅ agent-server-tests: SUCCESS
✅ windows-tests: SUCCESS
✅ cross-tests: SUCCESS
✅ tools-tests: SUCCESS
✅ workspace-tests: SUCCESS
✅ Python API: SUCCESS
✅ REST API (OpenAPI): SUCCESS
✅ pre-commit: SUCCESS

Issues Found

None.

Summary: The fix correctly addresses the stop-hook race condition by treating WebSocket FINISHED as a wakeup hint rather than authoritative termination. REST polling with a 3-poll threshold provides reliable confirmation that the server has actually finished (and stop hooks have been evaluated). The implementation is clean, well-tested, and preserves the fast-path for ERROR/STUCK states.

Co-authored-by: openhands <openhands@all-hands.dev>

openhands-agent added 2 commits May 10, 2026 15:55

docs(AGENTS): document RemoteConversation WS-vs-stop-hook race

bd2fd08

Persistent note for future sessions / agents debugging this surface so they don't have to re-derive the diagnosis from output.jsonl artifacts. Co-authored-by: openhands <openhands@all-hands.dev>

neubig mentioned this pull request May 10, 2026

Add programbench to run-eval benchmark choices #3075

Merged

neubig requested a review from all-hands-bot May 10, 2026 15:58

all-hands-bot reviewed May 10, 2026

View reviewed changes

Add programbench to run-eval benchmark choices (#3075)

1aa4176

Co-authored-by: openhands <openhands@all-hands.dev>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix RemoteConversation FINISHED stop-hook race#3191

Fix RemoteConversation FINISHED stop-hook race#3191
neubig wants to merge 3 commits into
mainfrom
fix/remote-conversation-finished-hint

neubig commented May 10, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

neubig commented May 10, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Verification

Stacking

Uh oh!

github-actions Bot commented May 10, 2026

Python API breakage checks — ✅ PASSED

Uh oh!

github-actions Bot commented May 10, 2026

REST API breakage checks (OpenAPI) — ✅ PASSED

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

✅ QA Report: PASS

Does this PR achieve its stated goal?

Verification Approach

Test 1: FINISHED is a hint, not authoritative termination

Test 2: ERROR and STUCK remain immediate terminal states

Test 3: REST polling confirmation pattern

Test 4: Regression test coverage

CI Verification

Issues Found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neubig commented May 10, 2026 •

edited by github-actions Bot

Loading