Skip to content

fix(cor-572): CLI client timeout vs draining/black-holed core host#1501

Draft
dvydra wants to merge 2 commits into
mainfrom
danielv/cor-572-entire-cli-hangs-against-a-drainingblack-holed-core-host-no
Draft

fix(cor-572): CLI client timeout vs draining/black-holed core host#1501
dvydra wants to merge 2 commits into
mainfrom
danielv/cor-572-entire-cli-hangs-against-a-drainingblack-holed-core-host-no

Conversation

@dvydra

@dvydra dvydra commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

https://entire.io/gh/entireio/cli/trails/644

COR-572 — entire CLI hangs against a draining / black-holed core host

Problem. Control-plane commands build their HTTP client in internal/coreapi/cross_juris_transport.go:newCrossJurisHTTPClient() (reached via both coreapi.New()clientForTarget and coreapi.NewWithBearer). That client sets no overall Timeout. The shared transport already bounds the TCP connect (httpclient.DefaultDialTimeout = 4s, env-overridable via ENTIRE_CONNECT_TIMEOUT_SECONDS), so a host that refuses/drops packets fails fast — but a draining host that accepts the connection and then never sends response headers (exactly the COR-562 cutover scenario) makes client.Do block until the OS TCP timeout. No CLI-level deadline applies.

Fix. Give the control-plane http.Client an overall request timeout (30s) at that single chokepoint. Every caller (New, NewForCluster, NewWithBearer, the ENTIRE_TOKEN bypass) inherits it. A per-command context deadline in corecmd.go was considered but the client-level timeout is the single point that covers all callers, including entire auth status, which builds its own /me client.

Note on the issue's "2s-dial". The repo standardized on a 4s dial timeout (httpclient.DefaultDialTimeout); I did not introduce a conflicting 2s value. Only the missing overall timeout is added.

Files touched

  • internal/coreapi/cross_juris_transport.go — add controlPlaneRequestTimeout const + set Timeout on the client (via a newCrossJurisHTTPClientWithTimeout helper so tests can use a short budget).
  • internal/coreapi/client_timeout_test.go — test: a silent/draining host returns a timeout error promptly instead of blocking.

Closes COR-572


🚧 Draft — NO MERGE / NO DEPLOY. Part of the COR-562 cutover tooling batch; scrapbook: entirehq/entiredb#2121.


Note

Low Risk
Narrow networking guardrail at one client factory; may surface timeouts on legitimately slow control-plane responses over 30s.

Overview
Fixes CLI hangs when the control-plane core accepts TCP but never returns HTTP headers (e.g. draining/black-holed host during cutover). Connect already fails fast via the shared 4s dial budget; the missing piece was an overall request deadline on client.Do.

Adds a 30s http.Client.Timeout at the single chokepoint in newCrossJurisHTTPClient (via a test-friendly helper so tests can use a shorter budget), so every control-plane path—coreapi.New, NewForCluster, NewWithBearer, and ENTIRE_TOKEN—inherits the same cap, including entire auth status’s /me client.

New test in client_timeout_test.go asserts a silent/draining host returns a timeout error instead of blocking indefinitely.

Reviewed by Cursor Bugbot for commit 38a6bc5. Configure here.

Copilot AI review requested due to automatic review settings June 23, 2026 05:33

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.

newCrossJurisHTTPClient had no overall Timeout, so a draining/black-holed
core that accepts the connection but never sends response headers made
`entire` commands hang on the OS TCP timeout. Add a 30s overall timeout at
that single chokepoint; every caller (New, NewForCluster, NewWithBearer,
the ENTIRE_TOKEN bypass) inherits it. The dial stays bounded by
httpclient.DefaultDialTimeout.

Trail: https://entire.io/gh/entireio/cli/trails/644
Closes COR-572

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dvydra

dvydra commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Review gauntlet — clean

Ran /review, /security-review, /docs-review, /simplify as inline manual passes (small diff: 2 files, +85 lines; the gstack specialist fan-out dispatches subagents, which this worker context can't spawn).

  • Correctness: the timeout surfaces as a standard net/url timeout error and propagates through the ogen client + existing CLI error handling. http.Client is constructed per call and is safe for concurrent use. No race, no swallowed errors.
  • Key risk checked: a 30s overall http.Client.Timeout would also abort a legitimately long control-plane call. Verified coreapi has no streaming / SSE / long-poll operations (all unary JSON), and the data plane (git clone/fetch) uses a separate client that is untouched. 30s is safe.
  • Security: net-positive — bounds an indefinite hang / resource exhaustion. No new input parsing, endpoints, auth, or secret handling; the cross-juris JWT trust anchors are untouched.
  • Simplify: the newCrossJurisHTTPClientWithTimeout helper is justified by testability (150ms budget in tests) and mirrors the existing federationLookupTimeout const idiom. No duplication.
  • Docs: no user-facing behavior change; no stale docs. The overall timeout is a fixed 30s const (not env-tunable) — possible follow-up if operators want ENTIRE_* tuning like the dial timeout already has.

mise run lint clean, mise run fmt clean, targeted tests pass (silent host times out in 0.15s; responsive host still succeeds).

🚧 Draft — NO MERGE / NO DEPLOY.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants