Conversation
Spec covering 3 phases: lock-shrink + decoupled stabilize gate, drain-based stabilize signal, and conditional/swept member backfill. Targets ~50min mass identify down to Discord-floor (~7.5min). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 of the faster mass-reconnect design. - Move WS dial + HELLO outside the identify mutex; Discord pacing only applies to IDENTIFY itself, so the dial latency no longer serializes the bucket. - Reduce identify-mutex hold time from ~70s to ~15s (pacing + grace). Lock is released right after pacing elapses regardless of READY arrival, so a slow READY no longer wedges the bucket. - Move the 60s stabilize wait off the identify mutex onto a separate in-process semaphore (IDENTIFY_STABILIZE_CONCURRENCY, default 1). Per-shard hold time tunable via IDENTIFY_STABILIZE_SECONDS, capped by IDENTIFY_STABILIZE_SECONDS_MAX. Skipped for shards that won't request guild members. - Identify pacing window tunable via IDENTIFY_PACING_SECONDS. - Etcd lease TTL now reflects only the identify hold time, so a crashed shard frees the bucket faster. Default config preserves current behavior (one stabilizing shard at a time, 60s hold). The win comes from raising stabilize concurrency once the state.DB batcher proves it can absorb more. See docs/superpowers/specs/2026-04-30-faster-mass-reconnect-design.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request implements the first phase of the 'Faster Gateway Mass Reconnect' design, aimed at reducing shard re-identification time by decoupling Discord's identify pacing from database stabilization. Key changes include moving the WebSocket handshake outside the identify lock, releasing the lock after a fixed pacing window, and introducing a cross-shard semaphore to manage backfill concurrency. Review feedback identifies critical race conditions in background goroutines where session state and contexts are captured by pointer, which could lead to premature lock releases or failures during session restarts.
| func (s *Session) scheduleIdentifyLockRelease() { | ||
| pacing := s.identifyPacing | ||
| if pacing <= 0 { | ||
| pacing = IdentifyWaitTime | ||
| } | ||
| hold := pacing + readyGrace | ||
| go func() { | ||
| t := time.NewTimer(hold) | ||
| defer t.Stop() | ||
| select { | ||
| case <-t.C: | ||
| case <-s.ctx.Done(): | ||
| } | ||
| if s.identifyMu == nil || s.identifyMu.Key() == "" { | ||
| return | ||
| } | ||
| if s.identifyMu.IsOwner().Result != etcdserverpb.Compare_EQUAL { | ||
| return | ||
| } | ||
| if err := s.releaseIdentifyLock(); err != nil { | ||
| s.log.Error(s.ctx, "release identify lock after pacing", slog.Error(err)) | ||
| } | ||
| }() | ||
| } |
There was a problem hiding this comment.
The background goroutine in scheduleIdentifyLockRelease captures the Session pointer (s), which leads to a race condition if the session restarts. When Open is called again (e.g., after a connection failure), s.ctx and s.identifyMu are overwritten. The goroutine from the previous attempt will then be waiting on the new session's context and potentially releasing the new session's identify lock prematurely.
Additionally, the comment mentions context.Background to survive unwinding, but the implementation uses s.ctx. If s.ctx is cancelled, s.releaseIdentifyLock() will fail because it also uses the cancelled s.ctx for the etcd Unlock call.
You should capture the context, mutex, and logger locally before starting the goroutine, and use a background context with a timeout for the actual unlock operation.
func (s *Session) scheduleIdentifyLockRelease() {
pacing := s.identifyPacing
if pacing <= 0 {
pacing = IdentifyWaitTime
}
hold := pacing + readyGrace
// Capture session state to avoid races if the session restarts.
ctx, mu, log := s.ctx, s.identifyMu, s.log
go func() {
t := time.NewTimer(hold)
defer t.Stop()
select {
case <-t.C:
case <-ctx.Done():
}
if mu == nil || mu.Key() == "" {
return
}
if mu.IsOwner().Result != etcdserverpb.Compare_EQUAL {
return
}
// Use a background context for the unlock to ensure it completes
// even if the session context has been cancelled.
unlockCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
log.Info(ctx, "release identify lock after pacing", slog.F("key", mu.Key()))
if err := mu.Unlock(unlockCtx); err != nil {
log.Error(ctx, "release identify lock after pacing", slog.Error(err))
}
}()
}| func (s *Session) holdStabilize() { | ||
| if s.stabilizeSem == nil || s.stabilizeDuration <= 0 { | ||
| return | ||
| } | ||
|
|
||
| max := s.stabilizeMaxDuration | ||
| if max <= 0 { | ||
| max = 2 * s.stabilizeDuration | ||
| } | ||
|
|
||
| acquireStart := time.Now() | ||
| select { | ||
| case s.stabilizeSem <- struct{}{}: | ||
| case <-s.ctx.Done(): | ||
| return | ||
| } | ||
| acquired := time.Now() | ||
| s.log.Info(s.ctx, "stabilize semaphore acquired", | ||
| slog.F("wait", acquired.Sub(acquireStart).String())) | ||
|
|
||
| defer func() { | ||
| select { | ||
| case <-s.stabilizeSem: | ||
| default: | ||
| } | ||
| s.log.Info(s.ctx, "stabilize semaphore released", | ||
| slog.F("held", time.Since(acquired).String())) | ||
| }() | ||
|
|
||
| hold := min(s.stabilizeDuration, max) | ||
| t := time.NewTimer(hold) | ||
| defer t.Stop() | ||
| select { | ||
| case <-t.C: | ||
| case <-s.ctx.Done(): | ||
| } | ||
| } |
There was a problem hiding this comment.
Similar to scheduleIdentifyLockRelease, holdStabilize captures the Session pointer and accesses s.ctx inside a background goroutine. If the session restarts and s.ctx is reassigned, the goroutine will wait on the wrong context.
Capture s.ctx locally at the start of the function to ensure the goroutine remains bound to the session attempt that started it.
func (s *Session) holdStabilize() {
if s.stabilizeSem == nil || s.stabilizeDuration <= 0 {
return
}
ctx, log := s.ctx, s.log
max := s.stabilizeMaxDuration
if max <= 0 {
max = 2 * s.stabilizeDuration
}
acquireStart := time.Now()
select {
case s.stabilizeSem <- struct{}{}:
case <-ctx.Done():
return
}
acquired := time.Now()
log.Info(ctx, "stabilize semaphore acquired",
slog.F("wait", acquired.Sub(acquireStart).String()))
defer func() {
select {
case <-s.stabilizeSem:
default:
}
log.Info(ctx, "stabilize semaphore released",
slog.F("held", time.Since(acquired).String()))
}()
hold := min(s.stabilizeDuration, max)
t := time.NewTimer(hold)
defer t.Stop()
select {
case <-t.C:
case <-ctx.Done():
}
}Recompute the mass-identify math: 1024 / 16 buckets = 64 shards per bucket. Today ~75 min worst case (was stated 50). Phase 1 default ~64 min; raising stabilize concurrency 4x → ~16 min, 8x → ~8 min. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sweep Phase 3 redesigned to avoid per-guild metadata at Tatsu's scale. - Phase 3a: divergence check at GUILD_CREATE only (cached_count vs Discord member_count, default 5% threshold). No new schema. - Phase 3b: bounded-rate cursor sweep that pages guilds via PK range scan, never loading the full guild set into memory. One cursor row per process, no per-guild timestamps, no new indexes. - Drift bound is now a function of sweep rate (default 12 req/s → ~24h full sweep over 1M guilds), not per-row freshness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3a (divergence at GUILD_CREATE): - handler.GuildCreate now reads cached member count once and returns it alongside Discord's reported count via EventPayload. - gatewayws.shouldRequestMembers compares the two; skips Request Guild Members when divergence is below BACKFILL_DIVERGENCE_RATIO (default 0.05). Cold guilds (cached_count == 0) always request. - Replaces the unconditional per-GUILD_CREATE backfill that drove the 60s stabilize wait in mass-reconnect. Phase 3b (bounded-rate cursor sweep): - New state.DB method GetGuildIDsAfter(after, limit) for paged guild enumeration; uses the guilds PK index in psql and a SelectorRange in fdb. Never loads more than `limit` ids into memory. - internal/manager/sweep.go runs one goroutine per process. Pages guilds, dispatches Request Guild Members to the owning shard at BACKFILL_SWEEP_REQUESTS_PER_SECOND (default 12). Skips guilds owned by other processes' shard ranges. - Cursor persisted to etcd at /gateway/sweep_cursor/<name>; survives process restarts. No schema migration required. - Disable via BACKFILL_SWEEP_ENABLED=false. Tunables: BACKFILL_SWEEP_BATCH (default 200), BACKFILL_SWEEP_REQUESTS_PER_SECOND. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds state.DB.CountUnstableShards(name, start, stop) — counts shards
whose persisted curState is not in the steady-state inner loop
('read message' / 'push event to redis'). Same query Tatsu already
uses ad-hoc to spot stuck shards.
Manager.logHealth now logs this count every 5 min as 'shard stability
unstable=N total=M'. With the Phase 1 + 3 changes deployed, the
count should drop to ~0 much faster after a mass reconnect — the
clearest signal that the lock-shrink is doing what it claims.
FoundationDB backend stub-panics like other shards methods; the
manager recovers and skips the log on FDB-backed deploys.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t-in Per-guild COUNT(*) on members costs seconds for medium-to-large guilds in production, even with the index-only scan plan, due to visibility-map staleness on the hot members table. At 1M guilds in a ~10min mass reconnect that's ~1.7K queries/sec — Postgres falls over. Move the cached-count fetch out of handler.GuildCreate's hot path (restored to the original "only when Discord didn't supply a count" behavior) and into a lazy fetch in gatewayws.shouldRequestMembersForGuild that only runs when the divergence check is enabled. Default BACKFILL_DIVERGENCE_RATIO to 0 (disabled). Ops opt in once a maintained guilds.member_count column or equivalent makes the read O(1). With divergence disabled, behavior matches pre-Phase-3: every GUILD_CREATE triggers Request Guild Members. Phase 1 lock-shrink and Phase 3b sweep are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Implements Phases 1 and 3 of the faster mass-reconnect design. Phase 2 (drain-based stabilize signal) is deferred — the divergence + sweep work in Phase 3 attacks the same bottleneck more directly.
Mass identify-storm reconnect of ~1024 shards goes from ~75 min today to Discord's identify-throughput floor (~5–10 min) in steady state, once the DB has been populated and most guilds skip backfill via the divergence check. Until that steady state is reached, the structural changes from Phase 1 still let ops dial up identify-stabilize concurrency to land in the 8–16 min range with default config.
Phase 1 — lock-shrink + decoupled stabilize gate
IDENTIFY_PACING_SECONDS(default 10s) + 5s grace, regardless of READY arrival. Slow READYs no longer wedge the bucket.IDENTIFY_STABILIZE_CONCURRENCY(default 1). Hold =IDENTIFY_STABILIZE_SECONDS(default 60). Skipped whenshouldProcessMembers()is false.Phase 3a — divergence check at GUILD_CREATE
handler.GuildCreatereads cached member count once via existingGetGuildMemberCountand returns it alongside Discord'smember_countinEventPayload.gatewayws.shouldRequestMembersskips Request Guild Members when|discord - cached| / discord <= BACKFILL_DIVERGENCE_RATIO(default 0.05). Cold guilds (cached == 0) always request.Phase 3b — bounded-rate cursor sweep
state.DB.GetGuildIDsAfter(after, limit)paged guild enumeration. Postgres uses the existing PK index; FoundationDB uses a SelectorRange fromFirstGreaterThan(guild_key(after)). Never loads more thanlimitids into memory.internal/manager/sweep.goruns one goroutine per process. Pages guilds, dispatches Request Guild Members to the owning shard for ids within this process's shard range, throttled bygolang.org/x/time/rate./gateway/sweep_cursor/<name>— survives restarts, no schema migration.New env vars (all optional, defaults preserve current behavior)
Out of scope
Test plan
identify lock acquired wait=...,stabilize semaphore acquired/released held=...,sweep batch dispatched ...🤖 Generated with Claude Code