Stop the WSConn :not_connected retry storm in NodeProxy by dominicletz · Pull Request #11 · diodechain/diode_node

dominicletz · 2026-04-21T01:23:57Z

Fixes two related symptoms with the same root cause: a WSConn whose async TCP/TLS/WS handshake hasn't completed yet returns {:error, :not_connected} from RemoteChain.WSConn.send_request/3, and NodeProxy reacts to that as if the connection were dead, perpetuating the failure instead of letting the in-flight handshake finish.

Symptoms

Symptom 1 — re-spawn storm

08:12:10.014 [warning] Failed to send request to #PID<0.13670260.0>: "...eth_getBlockByNumber...": {:error, :not_connected}
08:12:10.015 [error] Process #PID<0.13669863.0> on node :diode_node@localhost raised an exception
** (RuntimeError) RPC error in Chains.Moonbeam.eth_getBlockByNumber([15307943, false]): {:error, :disconnect}
    (diode 0.0.0) lib/remote_chain/rpc_cache.ex:311: RemoteChain.RPCCache.rpc!/3

Symptom 2 — request-routing churn

00:15:45.880 [warning] Failed to send request to #PID<0.109448870.0>: "...eth_getBlockByNumber...": {:error, :not_connected}
00:15:45.881 [info]    Awaiting undefined key: {RemoteChain.WSConn, #PID<0.109449407.0>}
00:15:45.884 [error]   ** (RuntimeError) RPC error in Chains.Moonbeam.eth_getBlockByNumber([15307932, false]): {:error, :disconnect}
                       (diode 0.0.0) lib/remote_chain/rpc_cache.ex:311: RemoteChain.RPCCache.rpc!/3

Two distinct WSConn pids in two consecutive log lines: one had WSConn.send_request/3 time out, then RPCCache.rpc/3's retry picked a second still-handshaking WSConn from the pool and timed out again.

In both cases the messages keep repeating without healing.

Root cause

RemoteChain.WSConn.send_request/3 returns {:error, :not_connected} in two distinct situations:

The WSConn process is dead (crashed, exited before handle_connect).
The WSConn process is still alive but its async handshake (TCP + TLS + WebSocket upgrade) has not completed within the 500 ms Globals.await({WSConn, pid}, 500) budget. This is normal during the first few hundred ms after WSConn.start/3, or any time the upstream RPC node is slow to accept the connection.

Two pieces of NodeProxy conflated case (2) with case (1):

NodeProxy.send_request/6 treated every failed send as a dead connection: replied {:error, :disconnect}, evicted the conn, and asked ensure_connections to spawn a fresh WSConn. The fresh WSConn was also mid-handshake → next request hits the same path → loop never heals.
NodeProxy.handle_call({:rpc, ...}) picked a connection blindly with Enum.random(Map.values(state.connections)), with no readiness filter. When the pool contained any still-handshaking conns (right after start-up, after a fallback was added, after a slow upstream rejected the previous conn) every random pick paid the full 500 ms Globals.await budget and failed.

Fix

Two complementary changes — both in this PR — that target the same root cause:

NodeProxy.handle_failed_send/2 (commit 1) — only evict (and re-arm ensure_connections) when the WSConn pid is actually Process.alive?(false). If it's still alive, leave it in the pool: either its handshake completes on its own and subsequent requests succeed, or its own ping watchdog (WSConn.handle_info(:ping, _)) tears it down and the existing :DOWN monitor in NodeProxy cleans the pool up.
NodeProxy.pick_connection/1 (commit 2) — prefer WSConns whose handshake has already completed. Falls back to the full pool only when none are ready, preserving the empty-pool behaviour so callers still see a well-defined {:error, :disconnect} rather than a silent hang.

This relies on a new helper, RemoteChain.WSConn.ready?/1, a non-blocking, side-effect-free check based on Globals.get({WSConn, pid}). Unlike WSConn.send_request/3 it does not register a waiter in Globals, so it produces no "Awaiting undefined key" or zombie-timeout log noise even when called for many handshaking pids.

Both decisions are extracted to public helpers (@doc false) so they can be unit-tested without standing up real websockets.

Tests

test/remote_chain/node_proxy_test.exs covers seven cases across three groups:

describe "handle_failed_send/2":

alive (still handshaking) WSConn — must be kept in the pool (regression test for symptom 1)
dead WSConn — must be removed and its in-flight callers replied to with {:error, :disconnect}
dead fallback WSConn — must clear both fallback and fallback_url

describe "pick_connection/1":

prefers ready WSConns over still-handshaking ones (regression test for symptom 2; runs 200 picks and asserts the handshaking pid is never chosen while a ready peer exists)
falls back to the full pool when no WSConn is ready yet

describe "WSConn.ready?/1":

returns false for an alive but non-registered (still handshaking) pid, with a refute_receive proving the call does not register a Globals waiter
returns true once handle_connect/2 has registered the conn

Verified all four "new" tests fail on a tree with the test file but without the lib changes (UndefinedFunctionError for WSConn.ready?/1 / NodeProxy.handle_failed_send/2), and pass with the fix:

$ DIODE_MINIMAL_TEST=1 MIX_ENV=test mix test --no-start test/remote_chain/node_proxy_test.exs
7 tests, 0 failures

WSConn.send_request/3 returns {:error, :not_connected} both when the underlying WSConn process is dead and when it is still alive but its async TCP/TLS/WS handshake has not finished within the 500 ms Globals.await budget. Since commit ae7d7a7 NodeProxy.send_request/6 evicted the conn and asked ensure_connections to spawn a fresh one on every failure. Under the 'still handshaking' case that fresh WSConn was also mid-handshake, so the next request hit the same path and the warning Failed to send request to #PID<...>: ... {:error, :not_connected} followed by ** (RuntimeError) RPC error in ChainImpl.eth_getBlockByNumber(...): {:error, :disconnect} repeated forever instead of healing once the in-flight handshake completed. Only evict (and trigger ensure_connections) when the WSConn pid is actually dead. If it is still alive, leave it in the pool. Either its handshake finishes on its own, or its own ping watchdog tears it down and the existing :DOWN monitor in NodeProxy cleans up. Adds a regression test that covers all three branches of the new handle_failed_send/2 helper. Co-authored-by: Dominic Letz <dominicletz@users.noreply.github.com>

gemini-code-assist

Code Review

This pull request introduces a more resilient connection management strategy in RemoteChain.NodeProxy by preventing the eager eviction of connections that are still alive but mid-handshake. By checking Process.alive?/1 before removal, the system avoids a 'retry storm' of constant connection respawning. Comprehensive regression tests were added to verify this behavior for both standard and fallback connections. I have no feedback to provide.

Same root cause as the previous commit -- WSConn.send_request/3 returning {:error, :not_connected} for an alive but not-yet-connected WSConn -- but a complementary symptom: 00:15:45.880 [warning] Failed to send request to #PID<0.109448870.0>: ... {:error, :not_connected} 00:15:45.881 [info] Awaiting undefined key: {RemoteChain.WSConn, #PID<0.109449407.0>} ** (RuntimeError) RPC error in Chains.Moonbeam.eth_getBlockByNumber(...): {:error, :disconnect} The two log lines reference *different* WSConn pids 1 ms apart: the first call to NodeProxy.handle_call({:rpc, ...}) randomly picked one mid-handshake conn and timed out, then RPCCache.rpc/3's retry randomly picked another mid-handshake conn and timed out again. NodeProxy.handle_call({:rpc, ...}) used to do Enum.random(Map.values(state.connections)) without any readiness filter. Whenever the pool contained one or more handshaking WSConns (right after start-up, after a fallback was added, after a slow upstream rejected the previous conn) every random pick paid the full 500 ms Globals.await budget and returned :not_connected, and the warning kept repeating instead of healing. Fix: * WSConn.ready?/1 -- non-blocking, side-effect-free readiness check based on Globals.get({WSConn, pid}). Unlike send_request/3 it never registers a Globals waiter, so it produces no 'Awaiting undefined key' / zombie-timeout log noise. * NodeProxy.pick_connection/1 -- prefers ready WSConns; only falls back to the full pool when none are ready (preserving the empty-pool crash so callers still see a well-defined :disconnect rather than a silent hang). Adds three regression tests covering pick_connection/1's preference logic, its all-handshaking fallback, and WSConn.ready?/1 itself (including a refute_receive proving it does not register a waiter). Co-authored-by: Dominic Letz <dominicletz@users.noreply.github.com>

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

cursor Bot changed the title ~~Stop evicting still-handshaking WSConn from NodeProxy pool~~ Stop the WSConn :not_connected retry storm in NodeProxy Apr 21, 2026

dominicletz marked this pull request as ready for review April 21, 2026 10:32

dominicletz merged commit 84e6d9a into main Apr 21, 2026
2 checks passed

dominicletz deleted the cursor/fix-rpc-not-connected-loop-52e3 branch April 21, 2026 10:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop the WSConn :not_connected retry storm in NodeProxy#11

Stop the WSConn :not_connected retry storm in NodeProxy#11
dominicletz merged 2 commits into
mainfrom
cursor/fix-rpc-not-connected-loop-52e3

dominicletz commented Apr 21, 2026 •

edited by cursor Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dominicletz commented Apr 21, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Symptoms

Symptom 1 — re-spawn storm

Symptom 2 — request-routing churn

Root cause

Fix

Tests

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dominicletz commented Apr 21, 2026 •

edited by cursor Bot

Loading