Skip to content

gateway: faster mass reconnect (Phases 1 + 3)#69

Open
bensonc wants to merge 7 commits intomasterfrom
feat/faster-mass-reconnect-phase1
Open

gateway: faster mass reconnect (Phases 1 + 3)#69
bensonc wants to merge 7 commits intomasterfrom
feat/faster-mass-reconnect-phase1

Conversation

@bensonc
Copy link
Copy Markdown
Member

@bensonc bensonc commented Apr 30, 2026

Summary

Implements Phases 1 and 3 of the faster mass-reconnect design. Phase 2 (drain-based stabilize signal) is deferred — the divergence + sweep work in Phase 3 attacks the same bottleneck more directly.

Mass identify-storm reconnect of ~1024 shards goes from ~75 min today to Discord's identify-throughput floor (~5–10 min) in steady state, once the DB has been populated and most guilds skip backfill via the divergence check. Until that steady state is reached, the structural changes from Phase 1 still let ops dial up identify-stabilize concurrency to land in the 8–16 min range with default config.

Phase 1 — lock-shrink + decoupled stabilize gate

  • WS dial + HELLO moved before identify mutex acquire.
  • Identify mutex held only for IDENTIFY_PACING_SECONDS (default 10s) + 5s grace, regardless of READY arrival. Slow READYs no longer wedge the bucket.
  • 60s stabilize wait moved to a per-process weighted semaphore. Capacity = IDENTIFY_STABILIZE_CONCURRENCY (default 1). Hold = IDENTIFY_STABILIZE_SECONDS (default 60). Skipped when shouldProcessMembers() is false.
  • Etcd lease TTL now reflects the new (~25s) hold time. Crashed shards free the bucket faster.

Phase 3a — divergence check at GUILD_CREATE

  • handler.GuildCreate reads cached member count once via existing GetGuildMemberCount and returns it alongside Discord's member_count in EventPayload.
  • gatewayws.shouldRequestMembers skips Request Guild Members when |discord - cached| / discord <= BACKFILL_DIVERGENCE_RATIO (default 0.05). Cold guilds (cached == 0) always request.
  • Steady-state mass reconnect on a populated DB skips most member backfill → no chunk burst → stabilize semaphore drains immediately.

Phase 3b — bounded-rate cursor sweep

  • New state.DB.GetGuildIDsAfter(after, limit) paged guild enumeration. Postgres uses the existing PK index; FoundationDB uses a SelectorRange from FirstGreaterThan(guild_key(after)). Never loads more than limit ids into memory.
  • internal/manager/sweep.go runs one goroutine per process. Pages guilds, dispatches Request Guild Members to the owning shard for ids within this process's shard range, throttled by golang.org/x/time/rate.
  • Cursor persisted to etcd at /gateway/sweep_cursor/<name> — survives restarts, no schema migration.
  • Wraps to 0 when end of guild data reached, with a brief sleep to avoid spinning on an empty DB.

New env vars (all optional, defaults preserve current behavior)

Var Default Purpose
IDENTIFY_PACING_SECONDS 10 Identify mutex hold time after IDENTIFY
IDENTIFY_STABILIZE_CONCURRENCY 1 Concurrent shards in backfill window
IDENTIFY_STABILIZE_SECONDS 60 Per-shard stabilize hold
IDENTIFY_STABILIZE_SECONDS_MAX 2x duration Stabilize safety bound
BACKFILL_DIVERGENCE_RATIO 0.05 Skip request when divergence <= this
BACKFILL_SWEEP_ENABLED true Set false to disable the background sweep
BACKFILL_SWEEP_REQUESTS_PER_SECOND 12 Global sweep dispatch rate
BACKFILL_SWEEP_BATCH 200 Max ids loaded per sweep query

Out of scope

  • Phase 2 (drain-based stabilize signal). Skipped because Phase 3 makes the fixed 60s wait the right knob to keep — most shards trigger zero or few backfills at steady state.
  • Auto-discovering Discord max_concurrency to expand from 16 buckets.
  • Replacing etcd identify mutex with in-process semaphore (separate cleanup).

Test plan

  • go build ./... clean
  • go test ./internal/gatewayws/... ./internal/manager/... ./discord/... ./handler/... clean
  • Deploy to non-prod cluster; confirm log lines: identify lock acquired wait=..., stabilize semaphore acquired/released held=..., sweep batch dispatched ...
  • Force a mass reconnect; confirm reconnect completes faster than today even with default config
  • Bump IDENTIFY_STABILIZE_CONCURRENCY=4, observe state.DB batcher queue depth — if stable, ratchet further
  • Confirm sweep wraps after a full cycle; validate cursor persists across restarts via etcd

🤖 Generated with Claude Code

bensonc and others added 2 commits April 30, 2026 09:59
Spec covering 3 phases: lock-shrink + decoupled stabilize gate,
drain-based stabilize signal, and conditional/swept member backfill.
Targets ~50min mass identify down to Discord-floor (~7.5min).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 of the faster mass-reconnect design.

- Move WS dial + HELLO outside the identify mutex; Discord pacing only
  applies to IDENTIFY itself, so the dial latency no longer serializes
  the bucket.
- Reduce identify-mutex hold time from ~70s to ~15s (pacing + grace).
  Lock is released right after pacing elapses regardless of READY
  arrival, so a slow READY no longer wedges the bucket.
- Move the 60s stabilize wait off the identify mutex onto a separate
  in-process semaphore (IDENTIFY_STABILIZE_CONCURRENCY, default 1).
  Per-shard hold time tunable via IDENTIFY_STABILIZE_SECONDS, capped
  by IDENTIFY_STABILIZE_SECONDS_MAX. Skipped for shards that won't
  request guild members.
- Identify pacing window tunable via IDENTIFY_PACING_SECONDS.
- Etcd lease TTL now reflects only the identify hold time, so a
  crashed shard frees the bucket faster.

Default config preserves current behavior (one stabilizing shard at a
time, 60s hold). The win comes from raising stabilize concurrency once
the state.DB batcher proves it can absorb more.

See docs/superpowers/specs/2026-04-30-faster-mass-reconnect-design.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the first phase of the 'Faster Gateway Mass Reconnect' design, aimed at reducing shard re-identification time by decoupling Discord's identify pacing from database stabilization. Key changes include moving the WebSocket handshake outside the identify lock, releasing the lock after a fixed pacing window, and introducing a cross-shard semaphore to manage backfill concurrency. Review feedback identifies critical race conditions in background goroutines where session state and contexts are captured by pointer, which could lead to premature lock releases or failures during session restarts.

Comment thread internal/gatewayws/ws.go
Comment on lines +535 to +558
func (s *Session) scheduleIdentifyLockRelease() {
pacing := s.identifyPacing
if pacing <= 0 {
pacing = IdentifyWaitTime
}
hold := pacing + readyGrace
go func() {
t := time.NewTimer(hold)
defer t.Stop()
select {
case <-t.C:
case <-s.ctx.Done():
}
if s.identifyMu == nil || s.identifyMu.Key() == "" {
return
}
if s.identifyMu.IsOwner().Result != etcdserverpb.Compare_EQUAL {
return
}
if err := s.releaseIdentifyLock(); err != nil {
s.log.Error(s.ctx, "release identify lock after pacing", slog.Error(err))
}
}()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The background goroutine in scheduleIdentifyLockRelease captures the Session pointer (s), which leads to a race condition if the session restarts. When Open is called again (e.g., after a connection failure), s.ctx and s.identifyMu are overwritten. The goroutine from the previous attempt will then be waiting on the new session's context and potentially releasing the new session's identify lock prematurely.

Additionally, the comment mentions context.Background to survive unwinding, but the implementation uses s.ctx. If s.ctx is cancelled, s.releaseIdentifyLock() will fail because it also uses the cancelled s.ctx for the etcd Unlock call.

You should capture the context, mutex, and logger locally before starting the goroutine, and use a background context with a timeout for the actual unlock operation.

func (s *Session) scheduleIdentifyLockRelease() {
	pacing := s.identifyPacing
	if pacing <= 0 {
		pacing = IdentifyWaitTime
	}
	hold := pacing + readyGrace
	
	// Capture session state to avoid races if the session restarts.
	ctx, mu, log := s.ctx, s.identifyMu, s.log

	go func() {
		t := time.NewTimer(hold)
		defer t.Stop()
		select {
		case <-t.C:
		case <-ctx.Done():
		}
		if mu == nil || mu.Key() == "" {
			return
		}
		if mu.IsOwner().Result != etcdserverpb.Compare_EQUAL {
			return
		}

		// Use a background context for the unlock to ensure it completes
		// even if the session context has been cancelled.
		unlockCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
		defer cancel()

		log.Info(ctx, "release identify lock after pacing", slog.F("key", mu.Key()))
		if err := mu.Unlock(unlockCtx); err != nil {
			log.Error(ctx, "release identify lock after pacing", slog.Error(err))
		}
	}()
}

Comment thread internal/gatewayws/ws.go
Comment on lines +564 to +600
func (s *Session) holdStabilize() {
if s.stabilizeSem == nil || s.stabilizeDuration <= 0 {
return
}

max := s.stabilizeMaxDuration
if max <= 0 {
max = 2 * s.stabilizeDuration
}

acquireStart := time.Now()
select {
case s.stabilizeSem <- struct{}{}:
case <-s.ctx.Done():
return
}
acquired := time.Now()
s.log.Info(s.ctx, "stabilize semaphore acquired",
slog.F("wait", acquired.Sub(acquireStart).String()))

defer func() {
select {
case <-s.stabilizeSem:
default:
}
s.log.Info(s.ctx, "stabilize semaphore released",
slog.F("held", time.Since(acquired).String()))
}()

hold := min(s.stabilizeDuration, max)
t := time.NewTimer(hold)
defer t.Stop()
select {
case <-t.C:
case <-s.ctx.Done():
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to scheduleIdentifyLockRelease, holdStabilize captures the Session pointer and accesses s.ctx inside a background goroutine. If the session restarts and s.ctx is reassigned, the goroutine will wait on the wrong context.

Capture s.ctx locally at the start of the function to ensure the goroutine remains bound to the session attempt that started it.

func (s *Session) holdStabilize() {
	if s.stabilizeSem == nil || s.stabilizeDuration <= 0 {
		return
	}

	ctx, log := s.ctx, s.log
	max := s.stabilizeMaxDuration
	if max <= 0 {
		max = 2 * s.stabilizeDuration
	}

	acquireStart := time.Now()
	select {
	case s.stabilizeSem <- struct{}{}:
	case <-ctx.Done():
		return
	}
	acquired := time.Now()
	log.Info(ctx, "stabilize semaphore acquired",
		slog.F("wait", acquired.Sub(acquireStart).String()))

	defer func() {
		select {
		case <-s.stabilizeSem:
		default:
		}
		log.Info(ctx, "stabilize semaphore released",
			slog.F("held", time.Since(acquired).String()))
	}()

	hold := min(s.stabilizeDuration, max)
	t := time.NewTimer(hold)
	defer t.Stop()
	select {
	case <-t.C:
	case <-ctx.Done():
	}
}

bensonc and others added 3 commits April 30, 2026 15:45
Recompute the mass-identify math: 1024 / 16 buckets = 64 shards per
bucket. Today ~75 min worst case (was stated 50). Phase 1 default ~64
min; raising stabilize concurrency 4x → ~16 min, 8x → ~8 min.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sweep

Phase 3 redesigned to avoid per-guild metadata at Tatsu's scale.

- Phase 3a: divergence check at GUILD_CREATE only (cached_count vs
  Discord member_count, default 5% threshold). No new schema.
- Phase 3b: bounded-rate cursor sweep that pages guilds via PK range
  scan, never loading the full guild set into memory. One cursor row
  per process, no per-guild timestamps, no new indexes.
- Drift bound is now a function of sweep rate (default 12 req/s →
  ~24h full sweep over 1M guilds), not per-row freshness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3a (divergence at GUILD_CREATE):
- handler.GuildCreate now reads cached member count once and returns it
  alongside Discord's reported count via EventPayload.
- gatewayws.shouldRequestMembers compares the two; skips Request Guild
  Members when divergence is below BACKFILL_DIVERGENCE_RATIO (default
  0.05). Cold guilds (cached_count == 0) always request.
- Replaces the unconditional per-GUILD_CREATE backfill that drove the
  60s stabilize wait in mass-reconnect.

Phase 3b (bounded-rate cursor sweep):
- New state.DB method GetGuildIDsAfter(after, limit) for paged guild
  enumeration; uses the guilds PK index in psql and a SelectorRange
  in fdb. Never loads more than `limit` ids into memory.
- internal/manager/sweep.go runs one goroutine per process. Pages
  guilds, dispatches Request Guild Members to the owning shard at
  BACKFILL_SWEEP_REQUESTS_PER_SECOND (default 12). Skips guilds owned
  by other processes' shard ranges.
- Cursor persisted to etcd at /gateway/sweep_cursor/<name>; survives
  process restarts. No schema migration required.
- Disable via BACKFILL_SWEEP_ENABLED=false. Tunables:
  BACKFILL_SWEEP_BATCH (default 200), BACKFILL_SWEEP_REQUESTS_PER_SECOND.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bensonc bensonc changed the title gateway: shrink identify lock + decouple stabilize gate (Phase 1) gateway: faster mass reconnect (Phases 1 + 3) Apr 30, 2026
bensonc and others added 2 commits April 30, 2026 16:52
Adds state.DB.CountUnstableShards(name, start, stop) — counts shards
whose persisted curState is not in the steady-state inner loop
('read message' / 'push event to redis'). Same query Tatsu already
uses ad-hoc to spot stuck shards.

Manager.logHealth now logs this count every 5 min as 'shard stability
unstable=N total=M'. With the Phase 1 + 3 changes deployed, the
count should drop to ~0 much faster after a mass reconnect — the
clearest signal that the lock-shrink is doing what it claims.

FoundationDB backend stub-panics like other shards methods; the
manager recovers and skips the log on FDB-backed deploys.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t-in

Per-guild COUNT(*) on members costs seconds for medium-to-large
guilds in production, even with the index-only scan plan, due to
visibility-map staleness on the hot members table. At 1M guilds in a
~10min mass reconnect that's ~1.7K queries/sec — Postgres falls over.

Move the cached-count fetch out of handler.GuildCreate's hot path
(restored to the original "only when Discord didn't supply a count"
behavior) and into a lazy fetch in gatewayws.shouldRequestMembersForGuild
that only runs when the divergence check is enabled.

Default BACKFILL_DIVERGENCE_RATIO to 0 (disabled). Ops opt in once a
maintained guilds.member_count column or equivalent makes the read O(1).
With divergence disabled, behavior matches pre-Phase-3: every
GUILD_CREATE triggers Request Guild Members. Phase 1 lock-shrink and
Phase 3b sweep are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant