Skip to content

fix(watchdog): process each stale run in its own session to avoid lock contention#197

Merged
tofarr merged 4 commits into
mainfrom
fix/watchdog-session-lock-contention
Jun 18, 2026
Merged

fix(watchdog): process each stale run in its own session to avoid lock contention#197
tofarr merged 4 commits into
mainfrom
fix/watchdog-session-lock-contention

Conversation

@tofarr

@tofarr tofarr commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Problem

Two slow queries were observed in production on the automation_runs table:

Query Avg time Calls Avg rows returned
UPDATE … SET status, error_detail, completed_at WHERE id=? AND status=? 6.24s 92 0.261
UPDATE … SET status, completed_at WHERE id=? AND status=? 19.1s 10 0.5

Both filter by primary key (id), so the slowness is not an indexing problem — a PK lookup is nanoseconds. Multi-second execution on a single-row UPDATE is always lock contention.

Root cause

mark_stale_runs in watchdog.py opened one database session for the entire stale-run batch:

async with session_factory() as session:
    stale_runs = ...  # SELECT all stale runs

    for run in stale_runs:
        await _verify_and_mark_run(session, run, settings)
        # ↑ calls backend.verify_run() — slow external HTTP to sandbox
        # then executes UPDATE WHERE id=? AND status=RUNNING
        # → row lock acquired, NOT yet released

    if marked:
        await session.commit()   # ← locks released HERE, after the whole batch

Each UPDATE inside _verify_and_mark_run acquired a row-level exclusive lock that was held until the single session.commit() at the very end — across every subsequent backend.verify_run() call. When 10 stale runs were in a batch and each verification took ~2s, the first locked row was held for ~18s after its own UPDATE ran.

Callback endpoint UPDATEs (WHERE id=? AND status=RUNNING) on those same rows queued behind the watchdog's open transaction, producing the 6–19s waits. The < 1 avg rows returned confirms the optimistic lock was frequently lost: by the time the waiting UPDATE ran, the watchdog had already changed the status.

Fix

Fetch only the stale run IDs in a short-lived read session that closes immediately, then process each run in its own dedicated session, committing right after _verify_and_mark_run. Row locks are released as soon as each run is processed rather than held for the duration of the batch.

# 1. Short read session — closes before any per-run work
async with session_factory() as session:
    stale_run_ids = [...]

# 2. One session per run — lock released immediately after commit
for run_id in stale_run_ids:
    async with session_factory() as session:
        ...
        if await _verify_and_mark_run(session, run, settings):
            await session.commit()   # lock released here, not at end of batch

Testing

All 7 existing watchdog unit tests pass. The optimistic locking behaviour (AND status = RUNNING) is fully preserved.


This pull request was created by an AI agent (OpenHands) on behalf of the user.

Co-authored-by: openhands <openhands@all-hands.dev>
…k contention

The watchdog's mark_stale_runs held a single DB session open across the
entire stale-run batch. Each UPDATE inside _verify_and_mark_run acquired
a row-level exclusive lock that was held until the single session.commit()
at the very end — across every subsequent backend.verify_run() call
(external HTTP to sandboxes, potentially seconds each).

This caused severe lock contention: callback endpoint UPDATEs on the same
rows queued behind the watchdog batch, producing multi-second waits (up to
19s observed in production) and frequent optimistic-lock misses (avg rows
returned < 1).

Fix: fetch only the stale run IDs in a short-lived read session that closes
immediately, then process each run in its own dedicated session and commit
right after _verify_and_mark_run. Row locks are now released as soon as
each run is done rather than held for the entire batch.

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown

Coverage

@tofarr tofarr requested a review from all-hands-bot June 18, 2026 20:39

all-hands-bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste — Elegant, surgical fix to a real production problem.

The root cause is correctly identified (row-level exclusive locks held across all slow backend.verify_run() calls in a single batch session) and the fix is the right pattern: split the batch into a short-lived read session (ID-only) followed by one session per run, each committing immediately after its UPDATE. Lock window shrinks from the entire batch duration to milliseconds.

A few things worth calling out explicitly:

  • Correctness preserved. The optimistic locking predicate (AND status = RUNNING) in the UPDATE is untouched, so concurrent callbacks still win the race. The behavioral semantics are identical.
  • TOCTOU is handled. Between the ID read and the per-run re-fetch, a callback may have already changed a run's status. The re-fetch will load the run, _verify_and_mark_run will call backend.verify_run (wasteful but benign), and the UPDATE will see rowcount == 0 and return False. This is the same behaviour as before the fix.
  • Error handling is sound. If session.commit() throws, the exception is caught, the session auto-rolls back, and the loop continues — leaving the run in RUNNING state for the next watchdog cycle. Correct.
  • Comments add real value. The docstring paragraph and the inline comment explaining WHY the two-phase session split exists are precisely the kind of non-obvious invariants that belong in source code. No comment noise.

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟢 LOW
    Single-file change to a periodic background worker. The fix reduces lock contention without changing any business logic, public API surface, or data model. Worst case on a regression is the watchdog processes stale runs slightly less efficiently — there is no data-loss or API-breaking risk.

VERDICT:
Worth merging — Correct, minimal, well-explained. Solves a confirmed production problem.

KEY INSIGHT:
The two-phase read (IDs-only session closes first, then one session per run) is the textbook cure for long-held batch locks; the implementation here is clean and idiomatic.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

@malhotra5 malhotra5 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tofarr tofarr merged commit f0d6923 into main Jun 18, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants