Skip to content

fix: bound y-websocket reconnect rate on transient post-open closes#954

Closed
kptdobe wants to merge 1 commit into
mainfrom
cor-44/rapid-reconnect-guard
Closed

fix: bound y-websocket reconnect rate on transient post-open closes#954
kptdobe wants to merge 1 commit into
mainfrom
cor-44/rapid-reconnect-guard

Conversation

@kptdobe
Copy link
Copy Markdown
Contributor

@kptdobe kptdobe commented May 21, 2026

Summary

  • y-websocket resets its wsUnsuccessfulReconnects counter on every successful onopen, so any close that follows a brief successful handshake reschedules at ~100 ms. The 4401/4403 guard from fix: block y-websocket auto-reconnect during async IMS refresh #943 does not cover non-auth paths (1011 from initSession catch, 1005 from closeConn, 1006 from socket reset), and single users behind corporate Zscaler proxies have sustained 5k+ WS upgrades/sec/IP because of it (see COR-43).
  • Add a rapid-reconnect guard to createConnection: track open-then-close intervals on the provider and apply manual exponential backoff (1s/2s/4s/… capped at SHORT_SESSION_MAX_MS = 30 s) when sessions shorter than MIN_HEALTHY_SESSION_MS = 5 s repeat. A healthy session (≥ 5 s lived) resets the counter so a routine reconnect after a long session still reconnects via y-websocket's own timer.
  • The auth path from fix: block y-websocket auto-reconnect during async IMS refresh #943 is unchanged and the new guard never increments on 4401/4403 closes — the auth flow has its own loop guard via lastSentToken.

Implementation notes

  • The new constants live as module-private consts in blocks/edit/prose/index.js.
  • The guard sits BEFORE the existing token-refresh provider.protocols reassignment so a token refresh still happens for the next manual provider.connect() after the backoff.
  • y-websocket's internal 100 ms setTimeout(setupWS) still fires from onclose, but provider.disconnect() flips shouldConnect = false, so that timer's setupWS call no-ops. The manual setTimeout is what re-arms the connection.
  • No changes to da-y-wrapper or upstream y-websocket — the fix sits at the consuming-call site so it ships with da-live's normal release cadence.

Test plan

  • npm run lint clean on the touched files
  • Focused unit tests pass: npx wtr "./test/unit/blocks/edit/prose/index.test.js" — 35/35 green, including the new prose/index createConnection rapid-reconnect guard (COR-44) suite (7 new scenarios) plus an updated existing test that now reflects the post-guard shouldConnect = false parking state.
  • After merge, query Coralogix and confirm the dominant da-collab URLs from the COR-1 daily review drop out of the >95% exception-ratio bucket and that the populated "Network connection lost." exception count returns to baseline (<10k/24h) within 48 h.

Refs

🤖 Generated with Claude Code

y-websocket resets its backoff counter on every successful onopen, so
any close that follows a brief successful handshake reschedules at
~100ms. The 4401/4403 guard from #943 does not cover non-auth paths
(1011 from initSession catch, 1005/1006 from socket reset), and single
users behind corporate proxies can sustain 5k+ WS upgrades/sec/IP.

Add a rapid-reconnect guard to createConnection: track open/close
intervals on the provider and apply manual exponential backoff
(1s/2s/4s/... capped at 30s) when sessions shorter than 5s repeat. A
healthy session (>= MIN_HEALTHY_SESSION_MS) resets the counter. The
auth path from PR #943 is unchanged.

Refs: COR-44

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Signed-off-by: kptdobe <acapt@adobe.com>
@aem-code-sync
Copy link
Copy Markdown

aem-code-sync Bot commented May 21, 2026

Hello, I'm the AEM Code Sync Bot and I will run some actions to deploy your branch.
In case there are problems, just click the checkbox below to rerun the respective action.

  • Re-sync branch
Commits

@kptdobe
Copy link
Copy Markdown
Contributor Author

kptdobe commented May 21, 2026

Replaced by #955 — the branch name on this PR violated da-live CLAUDE.md (branches must be max 8 lowercase alphanumeric chars; this is an IMS constraint that breaks CI/preview). New PR opens the same change from branch cor44. Closing this one to avoid confusion.

@kptdobe kptdobe closed this May 21, 2026
@kptdobe kptdobe deleted the cor-44/rapid-reconnect-guard branch May 21, 2026 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants