gateway: faster mass reconnect (Phases 1 + 3) by bensonc · Pull Request #69 · tatsuworks/gateway

bensonc · 2026-04-30T05:05:39Z

Summary

Implements Phases 1 and 3 of the faster mass-reconnect design. Phase 2 (drain-based stabilize signal) is deferred — the divergence + sweep work in Phase 3 attacks the same bottleneck more directly.

Mass identify-storm reconnect of ~1024 shards goes from ~75 min today to Discord's identify-throughput floor (~5–10 min) in steady state, once the DB has been populated and most guilds skip backfill via the divergence check. Until that steady state is reached, the structural changes from Phase 1 still let ops dial up identify-stabilize concurrency to land in the 8–16 min range with default config.

Phase 1 — lock-shrink + decoupled stabilize gate

WS dial + HELLO moved before identify mutex acquire.
Identify mutex held only for IDENTIFY_PACING_SECONDS (default 10s) + 5s grace, regardless of READY arrival. Slow READYs no longer wedge the bucket.
60s stabilize wait moved to a per-process weighted semaphore. Capacity = IDENTIFY_STABILIZE_CONCURRENCY (default 1). Hold = IDENTIFY_STABILIZE_SECONDS (default 60). Skipped when shouldProcessMembers() is false.
Etcd lease TTL now reflects the new (~25s) hold time. Crashed shards free the bucket faster.

Phase 3a — divergence check at GUILD_CREATE

handler.GuildCreate reads cached member count once via existing GetGuildMemberCount and returns it alongside Discord's member_count in EventPayload.
gatewayws.shouldRequestMembers skips Request Guild Members when |discord - cached| / discord <= BACKFILL_DIVERGENCE_RATIO (default 0.05). Cold guilds (cached == 0) always request.
Steady-state mass reconnect on a populated DB skips most member backfill → no chunk burst → stabilize semaphore drains immediately.

Phase 3b — bounded-rate cursor sweep

New state.DB.GetGuildIDsAfter(after, limit) paged guild enumeration. Postgres uses the existing PK index; FoundationDB uses a SelectorRange from FirstGreaterThan(guild_key(after)). Never loads more than limit ids into memory.
internal/manager/sweep.go runs one goroutine per process. Pages guilds, dispatches Request Guild Members to the owning shard for ids within this process's shard range, throttled by golang.org/x/time/rate.
Cursor persisted to etcd at /gateway/sweep_cursor/<name> — survives restarts, no schema migration.
Wraps to 0 when end of guild data reached, with a brief sleep to avoid spinning on an empty DB.

New env vars (all optional, defaults preserve current behavior)

Var	Default	Purpose
IDENTIFY_PACING_SECONDS	10	Identify mutex hold time after IDENTIFY
IDENTIFY_STABILIZE_CONCURRENCY	1	Concurrent shards in backfill window
IDENTIFY_STABILIZE_SECONDS	60	Per-shard stabilize hold
IDENTIFY_STABILIZE_SECONDS_MAX	2x duration	Stabilize safety bound
BACKFILL_DIVERGENCE_RATIO	0.05	Skip request when divergence <= this
BACKFILL_SWEEP_ENABLED	true	Set false to disable the background sweep
BACKFILL_SWEEP_REQUESTS_PER_SECOND	12	Global sweep dispatch rate
BACKFILL_SWEEP_BATCH	200	Max ids loaded per sweep query

Out of scope

Phase 2 (drain-based stabilize signal). Skipped because Phase 3 makes the fixed 60s wait the right knob to keep — most shards trigger zero or few backfills at steady state.
Auto-discovering Discord max_concurrency to expand from 16 buckets.
Replacing etcd identify mutex with in-process semaphore (separate cleanup).

Test plan

go build ./... clean
go test ./internal/gatewayws/... ./internal/manager/... ./discord/... ./handler/... clean
Deploy to non-prod cluster; confirm log lines: identify lock acquired wait=..., stabilize semaphore acquired/released held=..., sweep batch dispatched ...
Force a mass reconnect; confirm reconnect completes faster than today even with default config
Bump IDENTIFY_STABILIZE_CONCURRENCY=4, observe state.DB batcher queue depth — if stable, ratchet further
Confirm sweep wraps after a full cycle; validate cursor persists across restarts via etcd

🤖 Generated with Claude Code

Spec covering 3 phases: lock-shrink + decoupled stabilize gate, drain-based stabilize signal, and conditional/swept member backfill. Targets ~50min mass identify down to Discord-floor (~7.5min). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 1 of the faster mass-reconnect design. - Move WS dial + HELLO outside the identify mutex; Discord pacing only applies to IDENTIFY itself, so the dial latency no longer serializes the bucket. - Reduce identify-mutex hold time from ~70s to ~15s (pacing + grace). Lock is released right after pacing elapses regardless of READY arrival, so a slow READY no longer wedges the bucket. - Move the 60s stabilize wait off the identify mutex onto a separate in-process semaphore (IDENTIFY_STABILIZE_CONCURRENCY, default 1). Per-shard hold time tunable via IDENTIFY_STABILIZE_SECONDS, capped by IDENTIFY_STABILIZE_SECONDS_MAX. Skipped for shards that won't request guild members. - Identify pacing window tunable via IDENTIFY_PACING_SECONDS. - Etcd lease TTL now reflects only the identify hold time, so a crashed shard frees the bucket faster. Default config preserves current behavior (one stabilizing shard at a time, 60s hold). The win comes from raising stabilize concurrency once the state.DB batcher proves it can absorb more. See docs/superpowers/specs/2026-04-30-faster-mass-reconnect-design.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request implements the first phase of the 'Faster Gateway Mass Reconnect' design, aimed at reducing shard re-identification time by decoupling Discord's identify pacing from database stabilization. Key changes include moving the WebSocket handshake outside the identify lock, releasing the lock after a fixed pacing window, and introducing a cross-shard semaphore to manage backfill concurrency. Review feedback identifies critical race conditions in background goroutines where session state and contexts are captured by pointer, which could lead to premature lock releases or failures during session restarts.

gemini-code-assist · 2026-04-30T05:07:38Z

+func (s *Session) scheduleIdentifyLockRelease() {
+	pacing := s.identifyPacing
+	if pacing <= 0 {
+		pacing = IdentifyWaitTime
+	}
+	hold := pacing + readyGrace
+	go func() {
+		t := time.NewTimer(hold)
+		defer t.Stop()
+		select {
+		case <-t.C:
+		case <-s.ctx.Done():
+		}
+		if s.identifyMu == nil || s.identifyMu.Key() == "" {
+			return
+		}
+		if s.identifyMu.IsOwner().Result != etcdserverpb.Compare_EQUAL {
+			return
+		}
+		if err := s.releaseIdentifyLock(); err != nil {
+			s.log.Error(s.ctx, "release identify lock after pacing", slog.Error(err))
+		}
+	}()
+}


The background goroutine in scheduleIdentifyLockRelease captures the Session pointer (s), which leads to a race condition if the session restarts. When Open is called again (e.g., after a connection failure), s.ctx and s.identifyMu are overwritten. The goroutine from the previous attempt will then be waiting on the new session's context and potentially releasing the new session's identify lock prematurely.

Additionally, the comment mentions context.Background to survive unwinding, but the implementation uses s.ctx. If s.ctx is cancelled, s.releaseIdentifyLock() will fail because it also uses the cancelled s.ctx for the etcd Unlock call.

You should capture the context, mutex, and logger locally before starting the goroutine, and use a background context with a timeout for the actual unlock operation.

func (s *Session) scheduleIdentifyLockRelease() { pacing := s.identifyPacing if pacing <= 0 { pacing = IdentifyWaitTime } hold := pacing + readyGrace // Capture session state to avoid races if the session restarts. ctx, mu, log := s.ctx, s.identifyMu, s.log go func() { t := time.NewTimer(hold) defer t.Stop() select { case <-t.C: case <-ctx.Done(): } if mu == nil || mu.Key() == "" { return } if mu.IsOwner().Result != etcdserverpb.Compare_EQUAL { return } // Use a background context for the unlock to ensure it completes // even if the session context has been cancelled. unlockCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second) defer cancel() log.Info(ctx, "release identify lock after pacing", slog.F("key", mu.Key())) if err := mu.Unlock(unlockCtx); err != nil { log.Error(ctx, "release identify lock after pacing", slog.Error(err)) } }() }

gemini-code-assist · 2026-04-30T05:07:39Z

+func (s *Session) holdStabilize() {
+	if s.stabilizeSem == nil || s.stabilizeDuration <= 0 {
+		return
+	}
+
+	max := s.stabilizeMaxDuration
+	if max <= 0 {
+		max = 2 * s.stabilizeDuration
+	}
+
+	acquireStart := time.Now()
+	select {
+	case s.stabilizeSem <- struct{}{}:
+	case <-s.ctx.Done():
+		return
+	}
+	acquired := time.Now()
+	s.log.Info(s.ctx, "stabilize semaphore acquired",
+		slog.F("wait", acquired.Sub(acquireStart).String()))
+
+	defer func() {
+		select {
+		case <-s.stabilizeSem:
+		default:
+		}
+		s.log.Info(s.ctx, "stabilize semaphore released",
+			slog.F("held", time.Since(acquired).String()))
+	}()
+
+	hold := min(s.stabilizeDuration, max)
+	t := time.NewTimer(hold)
+	defer t.Stop()
+	select {
+	case <-t.C:
+	case <-s.ctx.Done():
+	}
+}


Similar to scheduleIdentifyLockRelease, holdStabilize captures the Session pointer and accesses s.ctx inside a background goroutine. If the session restarts and s.ctx is reassigned, the goroutine will wait on the wrong context.

Capture s.ctx locally at the start of the function to ensure the goroutine remains bound to the session attempt that started it.

func (s *Session) holdStabilize() { if s.stabilizeSem == nil || s.stabilizeDuration <= 0 { return } ctx, log := s.ctx, s.log max := s.stabilizeMaxDuration if max <= 0 { max = 2 * s.stabilizeDuration } acquireStart := time.Now() select { case s.stabilizeSem <- struct{}{}: case <-ctx.Done(): return } acquired := time.Now() log.Info(ctx, "stabilize semaphore acquired", slog.F("wait", acquired.Sub(acquireStart).String())) defer func() { select { case <-s.stabilizeSem: default: } log.Info(ctx, "stabilize semaphore released", slog.F("held", time.Since(acquired).String())) }() hold := min(s.stabilizeDuration, max) t := time.NewTimer(hold) defer t.Stop() select { case <-t.C: case <-ctx.Done(): } }

Recompute the mass-identify math: 1024 / 16 buckets = 64 shards per bucket. Today ~75 min worst case (was stated 50). Phase 1 default ~64 min; raising stabilize concurrency 4x → ~16 min, 8x → ~8 min. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sweep Phase 3 redesigned to avoid per-guild metadata at Tatsu's scale. - Phase 3a: divergence check at GUILD_CREATE only (cached_count vs Discord member_count, default 5% threshold). No new schema. - Phase 3b: bounded-rate cursor sweep that pages guilds via PK range scan, never loading the full guild set into memory. One cursor row per process, no per-guild timestamps, no new indexes. - Drift bound is now a function of sweep rate (default 12 req/s → ~24h full sweep over 1M guilds), not per-row freshness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 3a (divergence at GUILD_CREATE): - handler.GuildCreate now reads cached member count once and returns it alongside Discord's reported count via EventPayload. - gatewayws.shouldRequestMembers compares the two; skips Request Guild Members when divergence is below BACKFILL_DIVERGENCE_RATIO (default 0.05). Cold guilds (cached_count == 0) always request. - Replaces the unconditional per-GUILD_CREATE backfill that drove the 60s stabilize wait in mass-reconnect. Phase 3b (bounded-rate cursor sweep): - New state.DB method GetGuildIDsAfter(after, limit) for paged guild enumeration; uses the guilds PK index in psql and a SelectorRange in fdb. Never loads more than `limit` ids into memory. - internal/manager/sweep.go runs one goroutine per process. Pages guilds, dispatches Request Guild Members to the owning shard at BACKFILL_SWEEP_REQUESTS_PER_SECOND (default 12). Skips guilds owned by other processes' shard ranges. - Cursor persisted to etcd at /gateway/sweep_cursor/<name>; survives process restarts. No schema migration required. - Disable via BACKFILL_SWEEP_ENABLED=false. Tunables: BACKFILL_SWEEP_BATCH (default 200), BACKFILL_SWEEP_REQUESTS_PER_SECOND. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds state.DB.CountUnstableShards(name, start, stop) — counts shards whose persisted curState is not in the steady-state inner loop ('read message' / 'push event to redis'). Same query Tatsu already uses ad-hoc to spot stuck shards. Manager.logHealth now logs this count every 5 min as 'shard stability unstable=N total=M'. With the Phase 1 + 3 changes deployed, the count should drop to ~0 much faster after a mass reconnect — the clearest signal that the lock-shrink is doing what it claims. FoundationDB backend stub-panics like other shards methods; the manager recovers and skips the log on FDB-backed deploys. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t-in Per-guild COUNT(*) on members costs seconds for medium-to-large guilds in production, even with the index-only scan plan, due to visibility-map staleness on the hot members table. At 1M guilds in a ~10min mass reconnect that's ~1.7K queries/sec — Postgres falls over. Move the cached-count fetch out of handler.GuildCreate's hot path (restored to the original "only when Discord didn't supply a count" behavior) and into a lazy fetch in gatewayws.shouldRequestMembersForGuild that only runs when the divergence check is enabled. Default BACKFILL_DIVERGENCE_RATIO to 0 (disabled). Ops opt in once a maintained guilds.member_count column or equivalent makes the read O(1). With divergence disabled, behavior matches pre-Phase-3: every GUILD_CREATE triggers Request Guild Members. Phase 1 lock-shrink and Phase 3b sweep are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bensonc and others added 2 commits April 30, 2026 09:59

gemini-code-assist Bot reviewed Apr 30, 2026

View reviewed changes

bensonc and others added 3 commits April 30, 2026 15:45

bensonc changed the title ~~gateway: shrink identify lock + decouple stabilize gate (Phase 1)~~ gateway: faster mass reconnect (Phases 1 + 3) Apr 30, 2026

bensonc and others added 2 commits April 30, 2026 16:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gateway: faster mass reconnect (Phases 1 + 3)#69

gateway: faster mass reconnect (Phases 1 + 3)#69
bensonc wants to merge 7 commits intomasterfrom
feat/faster-mass-reconnect-phase1

bensonc commented Apr 30, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bensonc commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Phase 1 — lock-shrink + decoupled stabilize gate

Phase 3a — divergence check at GUILD_CREATE

Phase 3b — bounded-rate cursor sweep

New env vars (all optional, defaults preserve current behavior)

Out of scope

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bensonc commented Apr 30, 2026 •

edited

Loading