perf: SQLite optimization for large databases (1M+ torrents) by gschaer · Pull Request #4 · gmonarque/lighthouse

gschaer · 2026-03-25T17:10:35Z

Summary

Performance and bug fix improvements for Lighthouse instances with large databases (1M+ torrents).

Performance: SQLite indexes and query optimizations

Add 7 composite indexes for hot query paths (sort order, uploader lookup, category+sort, imdb/tmdb, activity log, comments)
PRAGMA tuning: 64MB cache, 256MB mmap, temp tables in memory
Rewrite search/stats queries from JOIN + DISTINCT to EXISTS subquery, eliminating temp B-trees
Drive aggregate stats from torrent_uploads (~10x smaller than torrents) instead of full table scans
Use INDEXED BY to force optimal index selection for category and sort queries
Cache stats response for 60 seconds
Lighten indexer status endpoint (two simple COUNTs instead of full stats)

Performance: Indexer runtime

Cache trusted uploaders and blacklist in memory (60s TTL) instead of querying the DB on every incoming event
Resume historical fetch from last indexed event timestamp, avoiding re-processing all events from all relays on every startup

Bug fixes

Config written to ./config/config.yaml (persisted volume) instead of ./config.yaml (ephemeral container root) — API key no longer regenerates on restart
Torznab handler accepts multi-user API keys with torznab/admin permission (previously only legacy config key worked)
IP rate limit increased from 100 to 300 req/min (dashboard UI makes ~10 API calls per page)

Test plan

Build: go build ./cmd/lighthouse/
Run tests: go test ./...
Verify new indexes are created on startup (check SELECT name FROM sqlite_master WHERE type='index')
Verify search, stats, and category filtering respond quickly
Verify API key persists across container restarts
Verify Torznab works with both legacy and multi-user API keys
Verify historical fetch resumes from last event on restart (check logs for since_unix)

🤖 Generated with Claude Code

…nt limit FetchAllHistoricalTorrents loops through all pages using a decreasing Until filter until the relay returns an empty batch, signaling end of history. The relay's internal limit per query is 100 events. Called in a goroutine at indexer startup alongside the real-time subscription — duplicates are handled by the existing deduplicator.

Database-level performance improvements for large databases (1M+ torrents): Schema: - Add composite indexes: (trust_score, first_seen_at), (uploader_npub, torrent_id), (category, trust_score, first_seen_at), imdb_id, tmdb_id, (event_type, created_at), (infohash, created_at) Connection tuning: - PRAGMA cache_size = 64MB (up from 2MB default) - PRAGMA mmap_size = 256MB (memory-mapped I/O) - PRAGMA temp_store = MEMORY Query optimizations: - Search: use EXISTS subquery instead of JOIN + DISTINCT, eliminating temp B-trees for dedup and sorting. Force sort index and composite category+sort index via INDEXED BY. - Stats: drive aggregate queries from torrent_uploads (~10x smaller than torrents table) instead of full table scans. Merge count + sum into single query. Cache response for 60 seconds. - Indexer status: replace heavy GetStats() with two simple COUNTs - Search counts: use category index directly for category-filtered counts, torrent_uploads for unfiltered counts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Indexer runtime optimizations: - Cache trusted uploaders and blacklist sets in memory with 60s TTL, eliminating repeated DB queries on every incoming event. Uses double-checked locking for thread safety. - Resume historical fetch from the latest indexed event timestamp instead of re-fetching all history from every relay on each startup. Dramatically reduces startup time and duplicate event processing. - FetchAllHistoricalTorrents now accepts a sinceTimestamp parameter to use as the Since filter in relay queries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Write config.yaml to ./config/ directory (mapped to /app/config volume) instead of ./ (ephemeral container root). Prevents the API key from being regenerated on every container restart. - Torznab handler now accepts both legacy config API key and multi-user database API keys with torznab or admin permission. Previously only the legacy key was checked, causing admin-created keys to return Torznab error code 100. - Increase IP rate limit from 100 to 300 req/min. The dashboard UI makes ~10 API calls per page navigation, so switching pages quickly exceeded the previous limit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gmonarque

nice enhancements, thanks :) some comments to discuss a bit but almost good to go

gmonarque · 2026-03-26T18:07:07Z

internal/api/handlers/stats.go

 		var totalSize sql.NullInt64
-		sizeQuery := `SELECT SUM(t.size) FROM torrents t
-			JOIN torrent_uploads tu ON t.id = tu.torrent_id
+		summaryQuery := `SELECT COUNT(*), COALESCE(SUM(t.size), 0)


I think this changes the meaning of the dashboard stats. torrent_uploads is one row per upload event, not one row per torrent, so COUNT(*) and SUM(t.size) here will overcount anything that was uploaded more than once

Good catch. Fixed — now uses COUNT(DISTINCT tu.torrent_id). Benchmarked at the same speed as COUNT(*) on 1M rows, so no performance regression.

gmonarque · 2026-03-26T18:08:13Z

internal/database/sqlite.go


+// GetLatestEventTimestamp returns the unix timestamp of the most recently uploaded event.
+// Used to resume historical fetch from where we left off.
+func GetLatestEventTimestamp() (int64, error) {


This resume cursor is based on uploaded_at, which is the local insert time. The relay since filter uses the Nostr event timestamp, so late-arriving older events can get skipped after a restart?

You're right. Changed to MAX(first_seen_at) FROM torrents which is indexed (vs the unindexed uploaded_at which requires a full table scan). Added a 1-hour buffer subtraction to handle clock skew and late-arriving events — this means a small overlap is re-processed on restart, but the deduplicator handles that safely.

gmonarque · 2026-03-26T18:08:57Z

internal/nostr/relay.go

+			log.Info().Str("relay", url).Int("page", page).Int("batch", len(events)).Int("total", totalFetched).Msg("Historical page fetched")
+
+			// Advance Until to just before the oldest event in this batch
+			oldest := events[len(events)-1].CreatedAt - 1


I think this can drop events at page boundaries. If more than one page of events shares the same CreatedAt second, moving until to oldest - 1 skips the rest of that second entirely. (minor)

Fixed. Now uses the exact oldest timestamp instead of oldest - 1, with loop detection: if the next page returns the same until value, we decrement to avoid infinite pagination. The deduplicator handles any duplicate events from the overlap.

gmonarque · 2026-03-26T18:09:41Z

internal/api/handlers/search.go

-				countQuery += " AND t.category = ?"
-				countArgs = append(countArgs, categoryNum)
-			}
+	} else if category != "" {


total doesn’t look aligned with results anymore. This path drops the trust filter for category-only counts, and the unfiltered branch below counts upload rows rather than distinct torrents, so pagination can report more hits than the query can actually return

Fixed the unfiltered count — now uses COUNT(DISTINCT torrent_id) FROM torrent_uploads which is correct and performs the same as COUNT(*) in benchmarks.

For category-filtered counts, the exact trust-filtered count requires a full JOIN across both tables (O(n) scan), which is prohibitively slow on large databases. We keep the category-index-only count (O(log n)) as an approximation. In practice this is accurate because the indexer already filters by trusted authors at ingest time, so untrusted torrents rarely exist in the table. Added a comment explaining this trade-off.

As a future improvement, the UI could display approximate counts as "~172,000 results" or "1-50 of many" (similar to Gmail's search) instead of an exact total, which would make this trade-off transparent to the user.

- Stats: use COUNT(DISTINCT torrent_id) to avoid overcounting torrents with multiple uploads (same performance as COUNT(*) in benchmarks) - Resume cursor: use MAX(first_seen_at) from torrents (indexed, 1ms) instead of MAX(uploaded_at) from torrent_uploads (unindexed, slow). Subtract 1-hour buffer to handle clock skew and late-arriving events. - Pagination boundary: avoid skipping events sharing the same CreatedAt second by using exact timestamp instead of oldest-1, with loop detection to prevent infinite pages. - Search count: use COUNT(DISTINCT torrent_id) for unfiltered counts. Category-filtered counts use category index directly (approximate but O(log n) vs O(n) for exact trust-filtered JOIN). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jblemee and others added 4 commits March 25, 2026 18:05

gmonarque assigned gmonarque and unassigned gmonarque Mar 26, 2026

gmonarque self-requested a review March 26, 2026 18:12

gmonarque added the enhancement New feature or request label Mar 26, 2026

gmonarque requested changes Mar 26, 2026

View reviewed changes

gschaer requested a review from gmonarque March 27, 2026 16:05

gmonarque approved these changes Mar 27, 2026

View reviewed changes

gmonarque merged commit 3397cd8 into gmonarque:main Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: SQLite optimization for large databases (1M+ torrents)#4

perf: SQLite optimization for large databases (1M+ torrents)#4
gmonarque merged 5 commits intogmonarque:mainfrom
gschaer:perf/sqlite-optimization

gschaer commented Mar 25, 2026

Uh oh!

gmonarque left a comment

Uh oh!

gmonarque Mar 26, 2026

Uh oh!

gschaer Mar 27, 2026 •

edited

Loading

Uh oh!

gmonarque Mar 26, 2026

Uh oh!

gschaer Mar 27, 2026

Uh oh!

gmonarque Mar 26, 2026

Uh oh!

gschaer Mar 27, 2026

Uh oh!

gmonarque Mar 26, 2026

Uh oh!

gschaer Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gschaer commented Mar 25, 2026

Summary

Performance: SQLite indexes and query optimizations

Performance: Indexer runtime

Bug fixes

Test plan

Uh oh!

gmonarque left a comment

Choose a reason for hiding this comment

Uh oh!

gmonarque Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

gschaer Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmonarque Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

gschaer Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

gmonarque Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

gschaer Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

gmonarque Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

gschaer Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gschaer Mar 27, 2026 •

edited

Loading