Skip to content

Stop the WSConn :not_connected retry storm in NodeProxy#11

Merged
dominicletz merged 2 commits into
mainfrom
cursor/fix-rpc-not-connected-loop-52e3
Apr 21, 2026
Merged

Stop the WSConn :not_connected retry storm in NodeProxy#11
dominicletz merged 2 commits into
mainfrom
cursor/fix-rpc-not-connected-loop-52e3

Conversation

@dominicletz
Copy link
Copy Markdown
Member

@dominicletz dominicletz commented Apr 21, 2026

Fixes two related symptoms with the same root cause: a WSConn whose async TCP/TLS/WS handshake hasn't completed yet returns {:error, :not_connected} from RemoteChain.WSConn.send_request/3, and NodeProxy reacts to that as if the connection were dead, perpetuating the failure instead of letting the in-flight handshake finish.

Symptoms

Symptom 1 — re-spawn storm

08:12:10.014 [warning] Failed to send request to #PID<0.13670260.0>: "...eth_getBlockByNumber...": {:error, :not_connected}
08:12:10.015 [error] Process #PID<0.13669863.0> on node :diode_node@localhost raised an exception
** (RuntimeError) RPC error in Chains.Moonbeam.eth_getBlockByNumber([15307943, false]): {:error, :disconnect}
    (diode 0.0.0) lib/remote_chain/rpc_cache.ex:311: RemoteChain.RPCCache.rpc!/3

Symptom 2 — request-routing churn

00:15:45.880 [warning] Failed to send request to #PID<0.109448870.0>: "...eth_getBlockByNumber...": {:error, :not_connected}
00:15:45.881 [info]    Awaiting undefined key: {RemoteChain.WSConn, #PID<0.109449407.0>}
00:15:45.884 [error]   ** (RuntimeError) RPC error in Chains.Moonbeam.eth_getBlockByNumber([15307932, false]): {:error, :disconnect}
                       (diode 0.0.0) lib/remote_chain/rpc_cache.ex:311: RemoteChain.RPCCache.rpc!/3

Two distinct WSConn pids in two consecutive log lines: one had WSConn.send_request/3 time out, then RPCCache.rpc/3's retry picked a second still-handshaking WSConn from the pool and timed out again.

In both cases the messages keep repeating without healing.

Root cause

RemoteChain.WSConn.send_request/3 returns {:error, :not_connected} in two distinct situations:

  1. The WSConn process is dead (crashed, exited before handle_connect).
  2. The WSConn process is still alive but its async handshake (TCP + TLS + WebSocket upgrade) has not completed within the 500 ms Globals.await({WSConn, pid}, 500) budget. This is normal during the first few hundred ms after WSConn.start/3, or any time the upstream RPC node is slow to accept the connection.

Two pieces of NodeProxy conflated case (2) with case (1):

  • NodeProxy.send_request/6 treated every failed send as a dead connection: replied {:error, :disconnect}, evicted the conn, and asked ensure_connections to spawn a fresh WSConn. The fresh WSConn was also mid-handshake → next request hits the same path → loop never heals.

  • NodeProxy.handle_call({:rpc, ...}) picked a connection blindly with Enum.random(Map.values(state.connections)), with no readiness filter. When the pool contained any still-handshaking conns (right after start-up, after a fallback was added, after a slow upstream rejected the previous conn) every random pick paid the full 500 ms Globals.await budget and failed.

Fix

Two complementary changes — both in this PR — that target the same root cause:

  1. NodeProxy.handle_failed_send/2 (commit 1) — only evict (and re-arm ensure_connections) when the WSConn pid is actually Process.alive?(false). If it's still alive, leave it in the pool: either its handshake completes on its own and subsequent requests succeed, or its own ping watchdog (WSConn.handle_info(:ping, _)) tears it down and the existing :DOWN monitor in NodeProxy cleans the pool up.

  2. NodeProxy.pick_connection/1 (commit 2) — prefer WSConns whose handshake has already completed. Falls back to the full pool only when none are ready, preserving the empty-pool behaviour so callers still see a well-defined {:error, :disconnect} rather than a silent hang.

    This relies on a new helper, RemoteChain.WSConn.ready?/1, a non-blocking, side-effect-free check based on Globals.get({WSConn, pid}). Unlike WSConn.send_request/3 it does not register a waiter in Globals, so it produces no "Awaiting undefined key" or zombie-timeout log noise even when called for many handshaking pids.

Both decisions are extracted to public helpers (@doc false) so they can be unit-tested without standing up real websockets.

Tests

test/remote_chain/node_proxy_test.exs covers seven cases across three groups:

describe "handle_failed_send/2":

  • alive (still handshaking) WSConn — must be kept in the pool (regression test for symptom 1)
  • dead WSConn — must be removed and its in-flight callers replied to with {:error, :disconnect}
  • dead fallback WSConn — must clear both fallback and fallback_url

describe "pick_connection/1":

  • prefers ready WSConns over still-handshaking ones (regression test for symptom 2; runs 200 picks and asserts the handshaking pid is never chosen while a ready peer exists)
  • falls back to the full pool when no WSConn is ready yet

describe "WSConn.ready?/1":

  • returns false for an alive but non-registered (still handshaking) pid, with a refute_receive proving the call does not register a Globals waiter
  • returns true once handle_connect/2 has registered the conn

Verified all four "new" tests fail on a tree with the test file but without the lib changes (UndefinedFunctionError for WSConn.ready?/1 / NodeProxy.handle_failed_send/2), and pass with the fix:

$ DIODE_MINIMAL_TEST=1 MIX_ENV=test mix test --no-start test/remote_chain/node_proxy_test.exs
7 tests, 0 failures
Open in Web Open in Cursor 

WSConn.send_request/3 returns {:error, :not_connected} both when the
underlying WSConn process is dead and when it is still alive but its
async TCP/TLS/WS handshake has not finished within the 500 ms
Globals.await budget.

Since commit ae7d7a7 NodeProxy.send_request/6 evicted the conn and
asked ensure_connections to spawn a fresh one on every failure. Under
the 'still handshaking' case that fresh WSConn was also mid-handshake,
so the next request hit the same path and the warning

  Failed to send request to #PID<...>: ... {:error, :not_connected}

followed by

  ** (RuntimeError) RPC error in ChainImpl.eth_getBlockByNumber(...):
     {:error, :disconnect}

repeated forever instead of healing once the in-flight handshake
completed.

Only evict (and trigger ensure_connections) when the WSConn pid is
actually dead. If it is still alive, leave it in the pool. Either its
handshake finishes on its own, or its own ping watchdog tears it down
and the existing :DOWN monitor in NodeProxy cleans up.

Adds a regression test that covers all three branches of the new
handle_failed_send/2 helper.

Co-authored-by: Dominic Letz <dominicletz@users.noreply.github.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a more resilient connection management strategy in RemoteChain.NodeProxy by preventing the eager eviction of connections that are still alive but mid-handshake. By checking Process.alive?/1 before removal, the system avoids a 'retry storm' of constant connection respawning. Comprehensive regression tests were added to verify this behavior for both standard and fallback connections. I have no feedback to provide.

Same root cause as the previous commit -- WSConn.send_request/3
returning {:error, :not_connected} for an alive but not-yet-connected
WSConn -- but a complementary symptom:

  00:15:45.880 [warning] Failed to send request to #PID<0.109448870.0>: ... {:error, :not_connected}
  00:15:45.881 [info] Awaiting undefined key: {RemoteChain.WSConn, #PID<0.109449407.0>}
  ** (RuntimeError) RPC error in Chains.Moonbeam.eth_getBlockByNumber(...): {:error, :disconnect}

The two log lines reference *different* WSConn pids 1 ms apart: the
first call to NodeProxy.handle_call({:rpc, ...}) randomly picked one
mid-handshake conn and timed out, then RPCCache.rpc/3's retry randomly
picked another mid-handshake conn and timed out again.

NodeProxy.handle_call({:rpc, ...}) used to do
Enum.random(Map.values(state.connections)) without any readiness
filter. Whenever the pool contained one or more handshaking WSConns
(right after start-up, after a fallback was added, after a slow
upstream rejected the previous conn) every random pick paid the full
500 ms Globals.await budget and returned :not_connected, and the
warning kept repeating instead of healing.

Fix:

  * WSConn.ready?/1 -- non-blocking, side-effect-free readiness check
    based on Globals.get({WSConn, pid}). Unlike send_request/3 it
    never registers a Globals waiter, so it produces no
    'Awaiting undefined key' / zombie-timeout log noise.

  * NodeProxy.pick_connection/1 -- prefers ready WSConns; only falls
    back to the full pool when none are ready (preserving the
    empty-pool crash so callers still see a well-defined
    :disconnect rather than a silent hang).

Adds three regression tests covering pick_connection/1's preference
logic, its all-handshaking fallback, and WSConn.ready?/1 itself
(including a refute_receive proving it does not register a waiter).

Co-authored-by: Dominic Letz <dominicletz@users.noreply.github.com>
@cursor cursor Bot changed the title Stop evicting still-handshaking WSConn from NodeProxy pool Stop the WSConn :not_connected retry storm in NodeProxy Apr 21, 2026
@dominicletz dominicletz marked this pull request as ready for review April 21, 2026 10:32
@dominicletz dominicletz merged commit 84e6d9a into main Apr 21, 2026
2 checks passed
@dominicletz dominicletz deleted the cursor/fix-rpc-not-connected-loop-52e3 branch April 21, 2026 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants