Skip to content

perf: SQLite optimization for large databases (1M+ torrents)#4

Merged
gmonarque merged 5 commits intogmonarque:mainfrom
gschaer:perf/sqlite-optimization
Mar 27, 2026
Merged

perf: SQLite optimization for large databases (1M+ torrents)#4
gmonarque merged 5 commits intogmonarque:mainfrom
gschaer:perf/sqlite-optimization

Conversation

@gschaer
Copy link
Copy Markdown
Contributor

@gschaer gschaer commented Mar 25, 2026

Summary

Performance and bug fix improvements for Lighthouse instances with large databases (1M+ torrents).

Performance: SQLite indexes and query optimizations

  • Add 7 composite indexes for hot query paths (sort order, uploader lookup, category+sort, imdb/tmdb, activity log, comments)
  • PRAGMA tuning: 64MB cache, 256MB mmap, temp tables in memory
  • Rewrite search/stats queries from JOIN + DISTINCT to EXISTS subquery, eliminating temp B-trees
  • Drive aggregate stats from torrent_uploads (~10x smaller than torrents) instead of full table scans
  • Use INDEXED BY to force optimal index selection for category and sort queries
  • Cache stats response for 60 seconds
  • Lighten indexer status endpoint (two simple COUNTs instead of full stats)

Performance: Indexer runtime

  • Cache trusted uploaders and blacklist in memory (60s TTL) instead of querying the DB on every incoming event
  • Resume historical fetch from last indexed event timestamp, avoiding re-processing all events from all relays on every startup

Bug fixes

  • Config written to ./config/config.yaml (persisted volume) instead of ./config.yaml (ephemeral container root) — API key no longer regenerates on restart
  • Torznab handler accepts multi-user API keys with torznab/admin permission (previously only legacy config key worked)
  • IP rate limit increased from 100 to 300 req/min (dashboard UI makes ~10 API calls per page)

Test plan

  • Build: go build ./cmd/lighthouse/
  • Run tests: go test ./...
  • Verify new indexes are created on startup (check SELECT name FROM sqlite_master WHERE type='index')
  • Verify search, stats, and category filtering respond quickly
  • Verify API key persists across container restarts
  • Verify Torznab works with both legacy and multi-user API keys
  • Verify historical fetch resumes from last event on restart (check logs for since_unix)

🤖 Generated with Claude Code

jblemee and others added 4 commits March 25, 2026 18:05
…nt limit

FetchAllHistoricalTorrents loops through all pages using a decreasing
Until filter until the relay returns an empty batch, signaling end of
history. The relay's internal limit per query is 100 events.

Called in a goroutine at indexer startup alongside the real-time
subscription — duplicates are handled by the existing deduplicator.
Database-level performance improvements for large databases (1M+ torrents):

Schema:
- Add composite indexes: (trust_score, first_seen_at), (uploader_npub,
  torrent_id), (category, trust_score, first_seen_at), imdb_id, tmdb_id,
  (event_type, created_at), (infohash, created_at)

Connection tuning:
- PRAGMA cache_size = 64MB (up from 2MB default)
- PRAGMA mmap_size = 256MB (memory-mapped I/O)
- PRAGMA temp_store = MEMORY

Query optimizations:
- Search: use EXISTS subquery instead of JOIN + DISTINCT, eliminating
  temp B-trees for dedup and sorting. Force sort index and composite
  category+sort index via INDEXED BY.
- Stats: drive aggregate queries from torrent_uploads (~10x smaller
  than torrents table) instead of full table scans. Merge count + sum
  into single query. Cache response for 60 seconds.
- Indexer status: replace heavy GetStats() with two simple COUNTs
- Search counts: use category index directly for category-filtered
  counts, torrent_uploads for unfiltered counts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Indexer runtime optimizations:

- Cache trusted uploaders and blacklist sets in memory with 60s TTL,
  eliminating repeated DB queries on every incoming event. Uses
  double-checked locking for thread safety.
- Resume historical fetch from the latest indexed event timestamp
  instead of re-fetching all history from every relay on each startup.
  Dramatically reduces startup time and duplicate event processing.
- FetchAllHistoricalTorrents now accepts a sinceTimestamp parameter
  to use as the Since filter in relay queries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Write config.yaml to ./config/ directory (mapped to /app/config
  volume) instead of ./ (ephemeral container root). Prevents the API
  key from being regenerated on every container restart.
- Torznab handler now accepts both legacy config API key and multi-user
  database API keys with torznab or admin permission. Previously only
  the legacy key was checked, causing admin-created keys to return
  Torznab error code 100.
- Increase IP rate limit from 100 to 300 req/min. The dashboard UI
  makes ~10 API calls per page navigation, so switching pages quickly
  exceeded the previous limit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gmonarque gmonarque assigned gmonarque and unassigned gmonarque Mar 26, 2026
@gmonarque gmonarque self-requested a review March 26, 2026 18:12
@gmonarque gmonarque added the enhancement New feature or request label Mar 26, 2026
Copy link
Copy Markdown
Owner

@gmonarque gmonarque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice enhancements, thanks :) some comments to discuss a bit but almost good to go

var totalSize sql.NullInt64
sizeQuery := `SELECT SUM(t.size) FROM torrents t
JOIN torrent_uploads tu ON t.id = tu.torrent_id
summaryQuery := `SELECT COUNT(*), COALESCE(SUM(t.size), 0)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this changes the meaning of the dashboard stats. torrent_uploads is one row per upload event, not one row per torrent, so COUNT(*) and SUM(t.size) here will overcount anything that was uploaded more than once

Copy link
Copy Markdown
Contributor Author

@gschaer gschaer Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed — now uses COUNT(DISTINCT tu.torrent_id). Benchmarked at the same speed as COUNT(*) on 1M rows, so no performance regression.


// GetLatestEventTimestamp returns the unix timestamp of the most recently uploaded event.
// Used to resume historical fetch from where we left off.
func GetLatestEventTimestamp() (int64, error) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This resume cursor is based on uploaded_at, which is the local insert time. The relay since filter uses the Nostr event timestamp, so late-arriving older events can get skipped after a restart?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Changed to MAX(first_seen_at) FROM torrents which is indexed (vs the unindexed uploaded_at which requires a full table scan). Added a 1-hour buffer subtraction to handle clock skew and late-arriving events — this means a small overlap is re-processed on restart, but the deduplicator handles that safely.

log.Info().Str("relay", url).Int("page", page).Int("batch", len(events)).Int("total", totalFetched).Msg("Historical page fetched")

// Advance Until to just before the oldest event in this batch
oldest := events[len(events)-1].CreatedAt - 1
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can drop events at page boundaries. If more than one page of events shares the same CreatedAt second, moving until to oldest - 1 skips the rest of that second entirely. (minor)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Now uses the exact oldest timestamp instead of oldest - 1, with loop detection: if the next page returns the same until value, we decrement to avoid infinite pagination. The deduplicator handles any duplicate events from the overlap.

countQuery += " AND t.category = ?"
countArgs = append(countArgs, categoryNum)
}
} else if category != "" {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total doesn’t look aligned with results anymore. This path drops the trust filter for category-only counts, and the unfiltered branch below counts upload rows rather than distinct torrents, so pagination can report more hits than the query can actually return

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the unfiltered count — now uses COUNT(DISTINCT torrent_id) FROM torrent_uploads which is correct and performs the same as COUNT(*) in benchmarks.

For category-filtered counts, the exact trust-filtered count requires a full JOIN across both tables (O(n) scan), which is prohibitively slow on large databases. We keep the category-index-only count (O(log n)) as an approximation. In practice this is accurate because the indexer already filters by trusted authors at ingest time, so untrusted torrents rarely exist in the table. Added a comment explaining this trade-off.

As a future improvement, the UI could display approximate counts as "~172,000 results" or "1-50 of many" (similar to Gmail's search) instead of an exact total, which would make this trade-off transparent to the user.

- Stats: use COUNT(DISTINCT torrent_id) to avoid overcounting torrents
  with multiple uploads (same performance as COUNT(*) in benchmarks)
- Resume cursor: use MAX(first_seen_at) from torrents (indexed, 1ms)
  instead of MAX(uploaded_at) from torrent_uploads (unindexed, slow).
  Subtract 1-hour buffer to handle clock skew and late-arriving events.
- Pagination boundary: avoid skipping events sharing the same CreatedAt
  second by using exact timestamp instead of oldest-1, with loop
  detection to prevent infinite pages.
- Search count: use COUNT(DISTINCT torrent_id) for unfiltered counts.
  Category-filtered counts use category index directly (approximate
  but O(log n) vs O(n) for exact trust-filtered JOIN).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gschaer gschaer requested a review from gmonarque March 27, 2026 16:05
@gmonarque gmonarque merged commit 3397cd8 into gmonarque:main Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants