Skip to content

swebenchmultimodal: diegomura/react-pdf-1178 consistently fails runtime init across models and runs #645

@simonrosenberg

Description

@simonrosenberg

Summary

The diegomura__react-pdf-1178 instance in swebenchmultimodal-dev consistently fails
runtime init
across multiple models and runs, even after the runtime-api / sdk fix for
#523 (OpenHands/software-agent-sdk#2656). This is deterministic, not transient: every
attempt in every run fails with the same 502→503 runtime init pattern.

Two sibling instances (react-pdf-1280, react-pdf-1552) also fail frequently but
intermittently — they sometimes succeed on retry.

Evidence

Run 1: 2026-04-06, opus4.6 + sonnet4.5 (evaluation run 24040981135)

Model Job react-pdf-1178 react-pdf-1280 react-pdf-1552
claude-4.6-opus eval-24040981135-claude-4-6 ❌ failed 4x ✅ (on retry) ✅ (on retry)
claude-sonnet-4-5-20250929 eval-24040981135-claude-son ❌ failed 4x ❌ failed 4x ❌ failed 4x

Run 2: 2026-04-04, gemini-3.1 (jobs eval-23983344719-gemini-3-1, eval-23968391657-gemini-3-1)

react-pdf-1178 failed on both runs (Datadog logs confirm failed after 4 attempts).

So: 3 distinct models × at least 3 eval runs over 3 days — react-pdf-1178 failed every
time.

Error pattern

From kube_job:eval-24040981135-claude-4-6 (Opus), instance diegomura__react-pdf-1178:

2026-04-06 19:43:17 ERROR  HTTP request failed (502 Bad Gateway): ...
  url: https://vqjiqojpacbxmmaw.eval-runtime.all-hands.dev/api/acp/conversations/9e52265f-...
2026-04-06 19:43:19 ERROR  HTTP request failed (503 Service Unavailable): ...
... (20+ seconds of 1/s polling with 503s) ...
2026-04-06 19:43:37 WARNING [worker] runtime init failure instance=diegomura__react-pdf-1178
    attempt=1 retry=1 runtime_id=vqjiqojpacbxmmaw
    error=Conversation run failed ...: Remote conversation ended with error
... (3 more attempts with different runtime_ids, same pattern) ...
2026-04-06 21:59:16 ERROR  [worker] Instance diegomura__react-pdf-1178 failed after 4 attempts.

Each attempt spins up a fresh runtime pod (new runtime_id), hits 502/503 on /api/acp/conversations/..., polls for 20+ seconds, gives up, and moves on. After 4 such attempts the
worker marks it failed.

Suspected root cause

Likely the same family as #352 (still OPEN): the
react-pdf-1178 image is slow to boot (or has a startup issue), the agent-server startup
probe on port 60000 fails, the pod is force-stopped, and the client sees 502/503 on
/api/acp/conversations/....

Unlike #523 (which was about large wp-calypso/p5.js images and was fixed by sdk#2656),
this one is narrower — a single react-pdf instance that consistently fails. Possibilities:

  1. Broken image: react-pdf-1178 Dockerfile produces an image that never becomes
    healthy (e.g. node version regression — see root_cause_acp_node.md in memory; Node
    v12–v14 crashes claude-agent-acp).
  2. Unreasonable startup time: image starts but takes longer than the startup probe
    window.
  3. ACP-specific init bug: the instance repo requires setup that ACP agents trip on.

Repro

/trigger-benchmarks swebenchmultimodal agent_type=acp-claude model_ids=claude-4.6-opus \
  eval_limit=500 reason="repro react-pdf-1178 failure"

Expected: react-pdf-1178 fails after 4 runtime init attempts.

Proposed next steps

  1. Inspect the SWE-bench-M image for diegomura/react-pdf-1178 — manually pull it,
    start agent-server in it, verify /server_info on port 60000 comes up.
  2. Check if Node.js is old in that image (see known ACP/Node issue); force Node 22
    install if missing.
  3. If image is fundamentally broken, consider blacklisting the 3 flaky instances
    (1178, 1280, 1552) or filing upstream with SWE-bench-M.
  4. Add a fast-fail path in the runtime init polling: if 5xx persists for >5s on the
    first attempt, abort that attempt instead of polling for 20s (saves ~60s per failed
    instance — currently ~90s of wall-clock wasted on retries).

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions