Skip to content

fix(phlower): partition invocations by day, drop instead of delete#26

Merged
webjunkie merged 2 commits into
mainfrom
fix/sqlite-daily-partitions
May 2, 2026
Merged

fix(phlower): partition invocations by day, drop instead of delete#26
webjunkie merged 2 commits into
mainfrom
fix/sqlite-daily-partitions

Conversation

@webjunkie
Copy link
Copy Markdown
Contributor

Why

Hourly purge runs DELETE FROM invocations WHERE finished_at < ? against a 26 GB table. The DELETE holds the SQLite write lock long enough that the flush loop can't drain its in-memory buffer. New events keep arriving, buffer grows, each subsequent flush is bigger, takes longer, falls further behind. RSS runs away, liveness probe kills the pod.

PR #25 made this gentler (batched purge with yields between batches), but it didn't fix the root cause — it just moved the threshold. Same death spiral fired again on May 2: mega-flushes of 200K → 678K → 2.55M → 1.47M records, then SIGKILL at 03:15 UTC.

What

Move invocations to daily-partitioned tables (invocations_YYYYMMDD, invocation_details_YYYYMMDD). Retention is now DROP TABLE — a metadata operation. The hourly purge stops contending with the flush loop entirely.

Pre-partition databases get migrated on first boot: ALTER TABLE invocations RENAME TO invocations_legacy. Rename is metadata-only, instant even on the 26 GB prod-us DB. Reads UNION across the legacy table until its data ages past retention, then it gets dropped wholesale.

Also caps the in-memory write-behind buffer (SQLITE_PENDING_BUFFER_CAP, default 200k records) with drop-oldest semantics. Bounded RAM matters more than capturing every record during overload — the alternative is OOM and losing all of them.

Notes

  • refresh_cached_stats uses MAX(rowid) on the legacy table instead of COUNT(*) — index-fast, accurate enough for healthz during the transition window.
  • Recovery still chunks per-partition by 4-hour windows so a multi-GB legacy table doesn't hold the read snapshot for minutes.
  • Today's and tomorrow's partitions get pre-created in the purge loop so midnight-UTC rollover doesn't race with the first flush of the day.
  • Disk-pressure path simplified: emergency mode just drops more partitions until under the cap. No more fallback DELETE.

Test plan

  • Smoke test: write across two days, query, purge — verified partition rotation
  • Migration test: pre-partition DB → renamed to _legacy, reads find both, recovery works (1000 rows replayed)
  • Docker build + boot: clean fresh-DB and migrated DB scenarios
  • Deploy to prod-us, monitor RSS over the next purge cycle (~01:00 UTC May 3)
  • Verify dropped-invocations counter stays at 0 in healthz logs

Hourly purge stalls flushes when retention DELETE has millions of rows
to scan. Buffer grows unbounded, RSS climbs, liveness probe kills the
pod. Same death spiral fired again on May 2 even with the batched purge
from #25 — it just took longer to get there. Mega-flushes of 200K → 678K
→ 2.55M → 1.47M records, then SIGKILL.

Switch invocations + invocation_details to daily-partitioned tables
(invocations_YYYYMMDD). Retention enforcement is now DROP TABLE — a
metadata operation that takes milliseconds. The hourly purge no longer
contends with the flush loop.

On first boot against a pre-partition database, the existing tables get
renamed to invocations_legacy / invocation_details_legacy. Reads UNION
across them; once their data ages past retention, they're dropped
wholesale. Rename is metadata-only — instant even on the 26 GB prod-us
DB.

Also cap the in-memory write-behind buffer (SQLITE_PENDING_BUFFER_CAP,
default 200k records). If the flush loop ever falls behind, we drop
oldest in chunks and log a warning. Bounded RAM matters more than
capturing every record during overload — the alternative is OOM and
losing all of them.

Recovery still chunks per-partition by 4-hour windows so a multi-GB
legacy table doesn't hold the read snapshot for minutes. healthz uses
MAX(rowid) on the legacy table instead of COUNT(*) — accurate enough,
and gone after retention anyway.

Refs: prod-us 1 restart on 2026-05-02 03:15 UTC; second RSS spike at
14:25 UTC self-recovered without restart.
@webjunkie webjunkie marked this pull request as ready for review May 2, 2026 20:27
@webjunkie
Copy link
Copy Markdown
Contributor Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6e2cc061e2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/phlower/sqlite_store.py Outdated
Comment thread src/phlower/store.py Outdated
Comment thread src/phlower/sqlite_store.py
- Migration: re-check `invocation_details` existence after the
  column-split path runs. Pre-split databases (the very-old schema
  with args/kwargs inline on `invocations`) used to get the split-
  produced details table orphaned because the rename branch keyed off
  the flag captured before the split.

- Backpressure: guard the warning modulus when the buffer cap is small.
  Tiny caps made `cap // 10` evaluate to 0 and crashed the warning
  gate with ZeroDivisionError on the first overflow. Use `drop_n`
  (already clamped to >=1) as the modulus.

- Reads: dedupe by task_id across partitions in `list_by_task` and
  `search`. A task whose lifecycle straddles midnight UTC can have
  rows in two partitions (RETRY yesterday, SUCCESS today). Rows
  arrive newest-first, so first-occurrence-wins gives the latest
  state. Aggregate recovery still slightly inflates counts for these
  edge cases — bounded by the tiny fraction of tasks crossing
  midnight, accepted as a follow-up.
@webjunkie webjunkie merged commit 799144b into main May 2, 2026
10 checks passed
@webjunkie webjunkie deleted the fix/sqlite-daily-partitions branch May 2, 2026 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant