fix(phlower): partition invocations by day, drop instead of delete#26
Merged
Conversation
Hourly purge stalls flushes when retention DELETE has millions of rows to scan. Buffer grows unbounded, RSS climbs, liveness probe kills the pod. Same death spiral fired again on May 2 even with the batched purge from #25 — it just took longer to get there. Mega-flushes of 200K → 678K → 2.55M → 1.47M records, then SIGKILL. Switch invocations + invocation_details to daily-partitioned tables (invocations_YYYYMMDD). Retention enforcement is now DROP TABLE — a metadata operation that takes milliseconds. The hourly purge no longer contends with the flush loop. On first boot against a pre-partition database, the existing tables get renamed to invocations_legacy / invocation_details_legacy. Reads UNION across them; once their data ages past retention, they're dropped wholesale. Rename is metadata-only — instant even on the 26 GB prod-us DB. Also cap the in-memory write-behind buffer (SQLITE_PENDING_BUFFER_CAP, default 200k records). If the flush loop ever falls behind, we drop oldest in chunks and log a warning. Bounded RAM matters more than capturing every record during overload — the alternative is OOM and losing all of them. Recovery still chunks per-partition by 4-hour windows so a multi-GB legacy table doesn't hold the read snapshot for minutes. healthz uses MAX(rowid) on the legacy table instead of COUNT(*) — accurate enough, and gone after retention anyway. Refs: prod-us 1 restart on 2026-05-02 03:15 UTC; second RSS spike at 14:25 UTC self-recovered without restart.
Contributor
Author
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6e2cc061e2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Migration: re-check `invocation_details` existence after the column-split path runs. Pre-split databases (the very-old schema with args/kwargs inline on `invocations`) used to get the split- produced details table orphaned because the rename branch keyed off the flag captured before the split. - Backpressure: guard the warning modulus when the buffer cap is small. Tiny caps made `cap // 10` evaluate to 0 and crashed the warning gate with ZeroDivisionError on the first overflow. Use `drop_n` (already clamped to >=1) as the modulus. - Reads: dedupe by task_id across partitions in `list_by_task` and `search`. A task whose lifecycle straddles midnight UTC can have rows in two partitions (RETRY yesterday, SUCCESS today). Rows arrive newest-first, so first-occurrence-wins gives the latest state. Aggregate recovery still slightly inflates counts for these edge cases — bounded by the tiny fraction of tasks crossing midnight, accepted as a follow-up.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Hourly purge runs
DELETE FROM invocations WHERE finished_at < ?against a 26 GB table. The DELETE holds the SQLite write lock long enough that the flush loop can't drain its in-memory buffer. New events keep arriving, buffer grows, each subsequent flush is bigger, takes longer, falls further behind. RSS runs away, liveness probe kills the pod.PR #25 made this gentler (batched purge with yields between batches), but it didn't fix the root cause — it just moved the threshold. Same death spiral fired again on May 2: mega-flushes of 200K → 678K → 2.55M → 1.47M records, then SIGKILL at 03:15 UTC.
What
Move invocations to daily-partitioned tables (
invocations_YYYYMMDD,invocation_details_YYYYMMDD). Retention is nowDROP TABLE— a metadata operation. The hourly purge stops contending with the flush loop entirely.Pre-partition databases get migrated on first boot:
ALTER TABLE invocations RENAME TO invocations_legacy. Rename is metadata-only, instant even on the 26 GB prod-us DB. Reads UNION across the legacy table until its data ages past retention, then it gets dropped wholesale.Also caps the in-memory write-behind buffer (
SQLITE_PENDING_BUFFER_CAP, default 200k records) with drop-oldest semantics. Bounded RAM matters more than capturing every record during overload — the alternative is OOM and losing all of them.Notes
refresh_cached_statsusesMAX(rowid)on the legacy table instead ofCOUNT(*)— index-fast, accurate enough for healthz during the transition window.Test plan
_legacy, reads find both, recovery works (1000 rows replayed)dropped-invocationscounter stays at 0 in healthz logs