fix(watchdog): process each stale run in its own session to avoid lock contention by tofarr · Pull Request #197 · OpenHands/automation

tofarr · 2026-06-18T20:17:31Z

Problem

Two slow queries were observed in production on the automation_runs table:

Query	Avg time	Calls	Avg rows returned
`UPDATE … SET status, error_detail, completed_at WHERE id=? AND status=?`	6.24s	92	0.261
`UPDATE … SET status, completed_at WHERE id=? AND status=?`	19.1s	10	0.5

Both filter by primary key (id), so the slowness is not an indexing problem — a PK lookup is nanoseconds. Multi-second execution on a single-row UPDATE is always lock contention.

Root cause

mark_stale_runs in watchdog.py opened one database session for the entire stale-run batch:

async with session_factory() as session:
    stale_runs = ...  # SELECT all stale runs

    for run in stale_runs:
        await _verify_and_mark_run(session, run, settings)
        # ↑ calls backend.verify_run() — slow external HTTP to sandbox
        # then executes UPDATE WHERE id=? AND status=RUNNING
        # → row lock acquired, NOT yet released

    if marked:
        await session.commit()   # ← locks released HERE, after the whole batch

Each UPDATE inside _verify_and_mark_run acquired a row-level exclusive lock that was held until the single session.commit() at the very end — across every subsequent backend.verify_run() call. When 10 stale runs were in a batch and each verification took ~2s, the first locked row was held for ~18s after its own UPDATE ran.

Callback endpoint UPDATEs (WHERE id=? AND status=RUNNING) on those same rows queued behind the watchdog's open transaction, producing the 6–19s waits. The < 1 avg rows returned confirms the optimistic lock was frequently lost: by the time the waiting UPDATE ran, the watchdog had already changed the status.

Fix

Fetch only the stale run IDs in a short-lived read session that closes immediately, then process each run in its own dedicated session, committing right after _verify_and_mark_run. Row locks are released as soon as each run is processed rather than held for the duration of the batch.

# 1. Short read session — closes before any per-run work
async with session_factory() as session:
    stale_run_ids = [...]

# 2. One session per run — lock released immediately after commit
for run_id in stale_run_ids:
    async with session_factory() as session:
        ...
        if await _verify_and_mark_run(session, run, settings):
            await session.commit()   # lock released here, not at end of batch

Testing

All 7 existing watchdog unit tests pass. The optimistic locking behaviour (AND status = RUNNING) is fully preserved.

This pull request was created by an AI agent (OpenHands) on behalf of the user.

Co-authored-by: openhands <openhands@all-hands.dev>

…k contention The watchdog's mark_stale_runs held a single DB session open across the entire stale-run batch. Each UPDATE inside _verify_and_mark_run acquired a row-level exclusive lock that was held until the single session.commit() at the very end — across every subsequent backend.verify_run() call (external HTTP to sandboxes, potentially seconds each). This caused severe lock contention: callback endpoint UPDATEs on the same rows queued behind the watchdog batch, producing multi-second waits (up to 19s observed in production) and frequent optimistic-lock misses (avg rows returned < 1). Fix: fetch only the stale run IDs in a short-lived read session that closes immediately, then process each run in its own dedicated session and commit right after _verify_and_mark_run. Row locks are now released as soon as each run is done rather than held for the entire batch. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-06-18T20:19:16Z

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot · 2026-06-18T20:40:16Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

🟢 Good taste — Elegant, surgical fix to a real production problem.

The root cause is correctly identified (row-level exclusive locks held across all slow backend.verify_run() calls in a single batch session) and the fix is the right pattern: split the batch into a short-lived read session (ID-only) followed by one session per run, each committing immediately after its UPDATE. Lock window shrinks from the entire batch duration to milliseconds.

A few things worth calling out explicitly:

Correctness preserved. The optimistic locking predicate (AND status = RUNNING) in the UPDATE is untouched, so concurrent callbacks still win the race. The behavioral semantics are identical.
TOCTOU is handled. Between the ID read and the per-run re-fetch, a callback may have already changed a run's status. The re-fetch will load the run, _verify_and_mark_run will call backend.verify_run (wasteful but benign), and the UPDATE will see rowcount == 0 and return False. This is the same behaviour as before the fix.
Error handling is sound. If session.commit() throws, the exception is caught, the session auto-rolls back, and the loop continues — leaving the run in RUNNING state for the next watchdog cycle. Correct.
Comments add real value. The docstring paragraph and the inline comment explaining WHY the two-phase session split exists are precisely the kind of non-obvious invariants that belong in source code. No comment noise.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟢 LOW
Single-file change to a periodic background worker. The fix reduces lock contention without changing any business logic, public API surface, or data model. Worst case on a regression is the watchdog processes stale runs slightly less efficiently — there is no data-loss or API-breaking risk.

VERDICT:
✅ Worth merging — Correct, minimal, well-explained. Solves a confirmed production problem.

KEY INSIGHT:
The two-phase read (IDs-only session closes first, then one session per run) is the textbook cure for long-held batch locks; the implementation here is clean and idiomatic.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

malhotra5

LGTM!

openhands-agent added 2 commits June 18, 2026 18:40

chore: bump version to 1.0.0a11, openhands-sdk/workspace to 1.29.0

5667745

Co-authored-by: openhands <openhands@all-hands.dev>

tofarr and others added 2 commits June 18, 2026 14:23

Merge branch 'main' into fix/watchdog-session-lock-contention

a56ba9a

style: fix ruff formatting in watchdog.py

976e2f1

Co-authored-by: openhands <openhands@all-hands.dev>

tofarr requested a review from all-hands-bot June 18, 2026 20:39

all-hands-bot approved these changes Jun 18, 2026

View reviewed changes

malhotra5 approved these changes Jun 18, 2026

View reviewed changes

tofarr merged commit f0d6923 into main Jun 18, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(watchdog): process each stale run in its own session to avoid lock contention#197

fix(watchdog): process each stale run in its own session to avoid lock contention#197
tofarr merged 4 commits into
mainfrom
fix/watchdog-session-lock-contention

tofarr commented Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

all-hands-bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

malhotra5 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tofarr commented Jun 18, 2026

Problem

Root cause

Fix

Testing

Uh oh!

github-actions Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

malhotra5 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions Bot commented Jun 18, 2026 •

edited

Loading

all-hands-bot commented Jun 18, 2026 •

edited

Loading