Skip to content

fix: avoid Worker CPU limit (1043) in /api/status gateway start#342

Merged
andreasjansson merged 7 commits intomainfrom
fix/ci-stability
Mar 29, 2026
Merged

fix: avoid Worker CPU limit (1043) in /api/status gateway start#342
andreasjansson merged 7 commits intomainfrom
fix/ci-stability

Conversation

@andreasjansson
Copy link
Copy Markdown
Member

Summary

Fixes the intermittent error 1043 (Worker exceeded CPU time limit) that causes the blank page timeout in CI.

Root cause

/api/status called ensureGateway() synchronously with a 25s Promise.race timeout. But ensureGateway internally calls waitForPort(timeout: 180s) which is an RPC that can't be cancelled. Even after the Promise.race timeout fires, the waitForPort RPC continues running in the background, exhausting the Worker's 30s CPU limit and causing error 1043. After a 1043, the Worker is completely unresponsive — the browser gets a blank page.

Fix

Add a waitForReady option to ensureGateway(). When false, it starts the gateway process (fast RPC, ~2-5s) but skips the waitForPort step. /api/status uses waitForReady: false so it returns quickly. The loading page polls every 2s — subsequent polls find the running process and check port readiness via the existing health check.

Other callers (crash retry, non-HTML catch-all) continue using waitForReady: true (the default) since they need to proxy immediately after the gateway starts.

ensureGateway's waitForPort blocks for up to 180s. Even with a 25s
Promise.race timeout, the underlying RPC continues running and
exhausts the Worker's 30s CPU limit (error 1043).

Fix: add waitForReady option to ensureGateway. When false, it starts
the process but returns immediately without waitForPort. The /api/status
handler uses this — the loading page polls every 2s and subsequent polls
check if the port is up via the existing process check.
containerFetch blocks until the container responds, which can take
30-60s+ on cold start. The browser gets a blank page because the Worker
times out before containerFetch returns. Add a 15s timeout for HTML
requests — on timeout, the catch block serves the loading page.
If containerFetch returns headers but the body stream hangs (gateway
partially initialized), httpResponse.text() blocks forever. The browser
gets a blank page. Add a 10s timeout — on timeout, serve the loading
page instead.
The 'base' variant in CI was showing the 'Configuration Required' error
page instead of the loading page because E2E_TEST_MODE didn't skip env
validation. The AI gateway keys may not be set for all variants.

The validateRequiredEnv function already checks isTestMode for CF Access
vars — extend the middleware to skip validation entirely in E2E mode,
matching the existing behavior in dev mode.
The blank page was caused by containerFetch hanging when the gateway
wasn't ready. Even with timeouts, the uncancelled background RPC
would exhaust the Worker CPU limit.

Fix: for HTML requests, check if the gateway process exists first
(3s timeout on findExistingGatewayProcess). If not running, serve
the loading page immediately without calling containerFetch. The
loading page handles polling, probing, and reloading.

This completely avoids calling containerFetch when the gateway isn't
ready, eliminating both the blank page and CPU limit issues.
@github-actions
Copy link
Copy Markdown

E2E Test Recording (base)

✅ Tests passed

E2E Test Video

@github-actions
Copy link
Copy Markdown

E2E Test Recording (discord)

✅ Tests passed

E2E Test Video

@github-actions
Copy link
Copy Markdown

E2E Test Recording (workers-ai)

✅ Tests passed

E2E Test Video

@github-actions
Copy link
Copy Markdown

E2E Test Recording (telegram)

✅ Tests passed

E2E Test Video

@andreasjansson andreasjansson merged commit 7b00c1d into main Mar 29, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant