fix(watchdog): process each stale run in its own session to avoid lock contention#197
Conversation
Co-authored-by: openhands <openhands@all-hands.dev>
…k contention The watchdog's mark_stale_runs held a single DB session open across the entire stale-run batch. Each UPDATE inside _verify_and_mark_run acquired a row-level exclusive lock that was held until the single session.commit() at the very end — across every subsequent backend.verify_run() call (external HTTP to sandboxes, potentially seconds each). This caused severe lock contention: callback endpoint UPDATEs on the same rows queued behind the watchdog batch, producing multi-second waits (up to 19s observed in production) and frequent optimistic-lock misses (avg rows returned < 1). Fix: fetch only the stale run IDs in a short-lived read session that closes immediately, then process each run in its own dedicated session and commit right after _verify_and_mark_run. Row locks are now released as soon as each run is done rather than held for the entire batch. Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟢 Good taste — Elegant, surgical fix to a real production problem.
The root cause is correctly identified (row-level exclusive locks held across all slow backend.verify_run() calls in a single batch session) and the fix is the right pattern: split the batch into a short-lived read session (ID-only) followed by one session per run, each committing immediately after its UPDATE. Lock window shrinks from the entire batch duration to milliseconds.
A few things worth calling out explicitly:
- Correctness preserved. The optimistic locking predicate (
AND status = RUNNING) in the UPDATE is untouched, so concurrent callbacks still win the race. The behavioral semantics are identical. - TOCTOU is handled. Between the ID read and the per-run re-fetch, a callback may have already changed a run's status. The re-fetch will load the run,
_verify_and_mark_runwill callbackend.verify_run(wasteful but benign), and the UPDATE will seerowcount == 0and returnFalse. This is the same behaviour as before the fix. - Error handling is sound. If
session.commit()throws, the exception is caught, the session auto-rolls back, and the loop continues — leaving the run in RUNNING state for the next watchdog cycle. Correct. - Comments add real value. The docstring paragraph and the inline comment explaining WHY the two-phase session split exists are precisely the kind of non-obvious invariants that belong in source code. No comment noise.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
Single-file change to a periodic background worker. The fix reduces lock contention without changing any business logic, public API surface, or data model. Worst case on a regression is the watchdog processes stale runs slightly less efficiently — there is no data-loss or API-breaking risk.
VERDICT:
✅ Worth merging — Correct, minimal, well-explained. Solves a confirmed production problem.
KEY INSIGHT:
The two-phase read (IDs-only session closes first, then one session per run) is the textbook cure for long-held batch locks; the implementation here is clean and idiomatic.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
Problem
Two slow queries were observed in production on the
automation_runstable:UPDATE … SET status, error_detail, completed_at WHERE id=? AND status=?UPDATE … SET status, completed_at WHERE id=? AND status=?Both filter by primary key (
id), so the slowness is not an indexing problem — a PK lookup is nanoseconds. Multi-second execution on a single-row UPDATE is always lock contention.Root cause
mark_stale_runsinwatchdog.pyopened one database session for the entire stale-run batch:Each
UPDATEinside_verify_and_mark_runacquired a row-level exclusive lock that was held until the singlesession.commit()at the very end — across every subsequentbackend.verify_run()call. When 10 stale runs were in a batch and each verification took ~2s, the first locked row was held for ~18s after its own UPDATE ran.Callback endpoint UPDATEs (
WHERE id=? AND status=RUNNING) on those same rows queued behind the watchdog's open transaction, producing the 6–19s waits. The < 1 avg rows returned confirms the optimistic lock was frequently lost: by the time the waiting UPDATE ran, the watchdog had already changed the status.Fix
Fetch only the stale run IDs in a short-lived read session that closes immediately, then process each run in its own dedicated session, committing right after
_verify_and_mark_run. Row locks are released as soon as each run is processed rather than held for the duration of the batch.Testing
All 7 existing watchdog unit tests pass. The optimistic locking behaviour (
AND status = RUNNING) is fully preserved.This pull request was created by an AI agent (OpenHands) on behalf of the user.