perf: SQLite optimization for large databases (1M+ torrents)#4
perf: SQLite optimization for large databases (1M+ torrents)#4gmonarque merged 5 commits intogmonarque:mainfrom
Conversation
…nt limit FetchAllHistoricalTorrents loops through all pages using a decreasing Until filter until the relay returns an empty batch, signaling end of history. The relay's internal limit per query is 100 events. Called in a goroutine at indexer startup alongside the real-time subscription — duplicates are handled by the existing deduplicator.
Database-level performance improvements for large databases (1M+ torrents): Schema: - Add composite indexes: (trust_score, first_seen_at), (uploader_npub, torrent_id), (category, trust_score, first_seen_at), imdb_id, tmdb_id, (event_type, created_at), (infohash, created_at) Connection tuning: - PRAGMA cache_size = 64MB (up from 2MB default) - PRAGMA mmap_size = 256MB (memory-mapped I/O) - PRAGMA temp_store = MEMORY Query optimizations: - Search: use EXISTS subquery instead of JOIN + DISTINCT, eliminating temp B-trees for dedup and sorting. Force sort index and composite category+sort index via INDEXED BY. - Stats: drive aggregate queries from torrent_uploads (~10x smaller than torrents table) instead of full table scans. Merge count + sum into single query. Cache response for 60 seconds. - Indexer status: replace heavy GetStats() with two simple COUNTs - Search counts: use category index directly for category-filtered counts, torrent_uploads for unfiltered counts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Indexer runtime optimizations: - Cache trusted uploaders and blacklist sets in memory with 60s TTL, eliminating repeated DB queries on every incoming event. Uses double-checked locking for thread safety. - Resume historical fetch from the latest indexed event timestamp instead of re-fetching all history from every relay on each startup. Dramatically reduces startup time and duplicate event processing. - FetchAllHistoricalTorrents now accepts a sinceTimestamp parameter to use as the Since filter in relay queries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Write config.yaml to ./config/ directory (mapped to /app/config volume) instead of ./ (ephemeral container root). Prevents the API key from being regenerated on every container restart. - Torznab handler now accepts both legacy config API key and multi-user database API keys with torznab or admin permission. Previously only the legacy key was checked, causing admin-created keys to return Torznab error code 100. - Increase IP rate limit from 100 to 300 req/min. The dashboard UI makes ~10 API calls per page navigation, so switching pages quickly exceeded the previous limit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gmonarque
left a comment
There was a problem hiding this comment.
nice enhancements, thanks :) some comments to discuss a bit but almost good to go
internal/api/handlers/stats.go
Outdated
| var totalSize sql.NullInt64 | ||
| sizeQuery := `SELECT SUM(t.size) FROM torrents t | ||
| JOIN torrent_uploads tu ON t.id = tu.torrent_id | ||
| summaryQuery := `SELECT COUNT(*), COALESCE(SUM(t.size), 0) |
There was a problem hiding this comment.
I think this changes the meaning of the dashboard stats. torrent_uploads is one row per upload event, not one row per torrent, so COUNT(*) and SUM(t.size) here will overcount anything that was uploaded more than once
There was a problem hiding this comment.
Good catch. Fixed — now uses COUNT(DISTINCT tu.torrent_id). Benchmarked at the same speed as COUNT(*) on 1M rows, so no performance regression.
|
|
||
| // GetLatestEventTimestamp returns the unix timestamp of the most recently uploaded event. | ||
| // Used to resume historical fetch from where we left off. | ||
| func GetLatestEventTimestamp() (int64, error) { |
There was a problem hiding this comment.
This resume cursor is based on uploaded_at, which is the local insert time. The relay since filter uses the Nostr event timestamp, so late-arriving older events can get skipped after a restart?
There was a problem hiding this comment.
You're right. Changed to MAX(first_seen_at) FROM torrents which is indexed (vs the unindexed uploaded_at which requires a full table scan). Added a 1-hour buffer subtraction to handle clock skew and late-arriving events — this means a small overlap is re-processed on restart, but the deduplicator handles that safely.
internal/nostr/relay.go
Outdated
| log.Info().Str("relay", url).Int("page", page).Int("batch", len(events)).Int("total", totalFetched).Msg("Historical page fetched") | ||
|
|
||
| // Advance Until to just before the oldest event in this batch | ||
| oldest := events[len(events)-1].CreatedAt - 1 |
There was a problem hiding this comment.
I think this can drop events at page boundaries. If more than one page of events shares the same CreatedAt second, moving until to oldest - 1 skips the rest of that second entirely. (minor)
There was a problem hiding this comment.
Fixed. Now uses the exact oldest timestamp instead of oldest - 1, with loop detection: if the next page returns the same until value, we decrement to avoid infinite pagination. The deduplicator handles any duplicate events from the overlap.
| countQuery += " AND t.category = ?" | ||
| countArgs = append(countArgs, categoryNum) | ||
| } | ||
| } else if category != "" { |
There was a problem hiding this comment.
total doesn’t look aligned with results anymore. This path drops the trust filter for category-only counts, and the unfiltered branch below counts upload rows rather than distinct torrents, so pagination can report more hits than the query can actually return
There was a problem hiding this comment.
Fixed the unfiltered count — now uses COUNT(DISTINCT torrent_id) FROM torrent_uploads which is correct and performs the same as COUNT(*) in benchmarks.
For category-filtered counts, the exact trust-filtered count requires a full JOIN across both tables (O(n) scan), which is prohibitively slow on large databases. We keep the category-index-only count (O(log n)) as an approximation. In practice this is accurate because the indexer already filters by trusted authors at ingest time, so untrusted torrents rarely exist in the table. Added a comment explaining this trade-off.
As a future improvement, the UI could display approximate counts as "~172,000 results" or "1-50 of many" (similar to Gmail's search) instead of an exact total, which would make this trade-off transparent to the user.
- Stats: use COUNT(DISTINCT torrent_id) to avoid overcounting torrents with multiple uploads (same performance as COUNT(*) in benchmarks) - Resume cursor: use MAX(first_seen_at) from torrents (indexed, 1ms) instead of MAX(uploaded_at) from torrent_uploads (unindexed, slow). Subtract 1-hour buffer to handle clock skew and late-arriving events. - Pagination boundary: avoid skipping events sharing the same CreatedAt second by using exact timestamp instead of oldest-1, with loop detection to prevent infinite pages. - Search count: use COUNT(DISTINCT torrent_id) for unfiltered counts. Category-filtered counts use category index directly (approximate but O(log n) vs O(n) for exact trust-filtered JOIN). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Performance and bug fix improvements for Lighthouse instances with large databases (1M+ torrents).
Performance: SQLite indexes and query optimizations
JOIN + DISTINCTtoEXISTSsubquery, eliminating temp B-treestorrent_uploads(~10x smaller thantorrents) instead of full table scansINDEXED BYto force optimal index selection for category and sort queriesPerformance: Indexer runtime
Bug fixes
./config/config.yaml(persisted volume) instead of./config.yaml(ephemeral container root) — API key no longer regenerates on restartTest plan
go build ./cmd/lighthouse/go test ./...SELECT name FROM sqlite_master WHERE type='index')since_unix)🤖 Generated with Claude Code