You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When Anthropic's status page reports degradation (503s, overload, outages), headless sessions hit API errors and fail. There is no mechanism to detect this proactively and pause before wasting tokens on doomed sessions.
Solution
A status health guard that mirrors the quota guard pattern — polling the Anthropic status page, caching the result, and blocking run_skill when the API is degraded.
Enforcement Strategy
Per #663, the hook-deny-with-sleep-instruction approach is unreliable — the LLM sometimes treats denials as terminal instead of sleeping and retrying. Whatever enforcement mechanism #663 adopts for the quota guard (likely server-side asyncio.sleep in run_skill before launching the headless session), the status health guard should use the same approach from the start.
The implementation should include:
Server-side gate in tools_execution.py — check status cache before launching the headless session; if degraded, asyncio.sleep(pause_seconds) then re-check in a loop until healthy
PreToolUse hook as defense-in-depth (same fail-open pattern as quota guard) — catches cases where the server-side gate is bypassed
Background refresh loop — keeps the cache fresh so recovery is detected during a sleep cycle
Polling & Caching
Background polling — polls https://status.anthropic.com/api/v2/status.json via a refresh loop (like quota_refresh_loop), caching to ~/.claude/autoskillit_status_cache.json
Cache-first check — check_status_health() reads cache when fresh, live-fetches on miss/stale
Degradation signal — status.indicator != "none" (values: minor, major, critical)
API Details
Endpoint:https://status.anthropic.com/api/v2/status.json (no auth, standard Statuspage.io)
REQ-POLL-001: The system must poll https://status.anthropic.com/api/v2/status.json on a configurable interval via a background refresh loop.
REQ-POLL-002: The system must cache the poll result to a configurable file path with a timestamp for freshness validation.
REQ-POLL-003: The system must prime the status cache at kitchen open before any run_skill hook fires.
REQ-POLL-004: The system must cancel the background refresh loop when the kitchen closes.
GATE — Enforcement Gate
REQ-GATE-001: The run_skill handler must check the status cache before launching a headless session and block with asyncio.sleep when the API is degraded.
REQ-GATE-002: The server-side sleep must re-check status after each pause cycle and resume only when status.indicator == "none".
REQ-GATE-003: A PreToolUse hook must serve as defense-in-depth, denying run_skill with a sleep instruction when the cache indicates degradation.
REQ-GATE-004: The hook must be stdlib-only with no autoskillit imports.
REQ-GATE-005: The hook must fail open on missing, stale, or corrupt cache.
REQ-CFG-001: The system must provide a StatusGuardConfig dataclass with fields: enabled, cache_max_age, cache_refresh_interval, cache_path, pause_seconds, status_url.
REQ-CFG-002: The default pause duration must be 1800 seconds (30 minutes).
REQ-CFG-003: The configuration must be exposed in defaults.yaml and wirable via the standard layered config resolution.
OBS — Observability
REQ-OBS-001: The hook must log events (approved, blocked, cache_miss) to status_events.jsonl in the autoskillit log directory.
REQ-OBS-002: A get_status_events MCP tool must expose the diagnostic log for pipeline debugging.
REQ-OBS-003:get_status_events must be registered in GATED_TOOLS, TOOL_SUBSET_TAGS, and TOOL_CATEGORIES.
WIRE — Integration Wiring
REQ-WIRE-001: The hook must be registered in hook_registry.py in the run_skill matcher scripts list.
REQ-WIRE-002:tools_kitchen.py must start and cancel the status refresh task alongside the quota refresh task.
REQ-WIRE-003:_write_hook_config must include the status_guard section for the hook subprocess.
REQ-WIRE-004:ToolContext must include a status_refresh_task field reusing the QuotaRefreshTask protocol.
Problem
When Anthropic's status page reports degradation (503s, overload, outages), headless sessions hit API errors and fail. There is no mechanism to detect this proactively and pause before wasting tokens on doomed sessions.
Solution
A status health guard that mirrors the quota guard pattern — polling the Anthropic status page, caching the result, and blocking
run_skillwhen the API is degraded.Enforcement Strategy
Per #663, the hook-deny-with-sleep-instruction approach is unreliable — the LLM sometimes treats denials as terminal instead of sleeping and retrying. Whatever enforcement mechanism #663 adopts for the quota guard (likely server-side
asyncio.sleepinrun_skillbefore launching the headless session), the status health guard should use the same approach from the start.The implementation should include:
tools_execution.py— check status cache before launching the headless session; if degraded,asyncio.sleep(pause_seconds)then re-check in a loop until healthyPolling & Caching
https://status.anthropic.com/api/v2/status.jsonvia a refresh loop (likequota_refresh_loop), caching to~/.claude/autoskillit_status_cache.jsoncheck_status_health()reads cache when fresh, live-fetches on miss/stalestatus.indicator != "none"(values:minor,major,critical)API Details
https://status.anthropic.com/api/v2/status.json(no auth, standard Statuspage.io){"status": {"indicator": "none|minor|major|critical", "description": "..."}}/api/v2/components.json— per-component status for "Claude Code" and "Claude API (api.anthropic.com)"Components
execution/status.pyStatusHealthdataclass,_fetch_status(), cache I/O,check_status_health(),_refresh_status_cache()hooks/status_health_guard.pystatus_events.jsonlconfig/settings.pyStatusGuardConfig(enabled, cache_max_age, cache_refresh_interval, cache_path, pause_seconds, status_url)config/defaults.yamlstatus_guardsectionhook_registry.pyrun_skillmatcher scripts listserver/helpers.py_prime_status_cache(),_status_refresh_loop()server/tools_kitchen.py_write_hook_configserver/tools_execution.pyrun_skill(match #663 pattern)server/tools_status.pyget_status_eventsdiagnostic toolpipeline/context.pystatus_refresh_taskfield (reuseQuotaRefreshTaskprotocol)core/_type_constants.pyget_status_eventsinGATED_TOOLS,TOOL_SUBSET_TAGS,TOOL_CATEGORIESConfig Defaults
Design Invariants
TypeErrorcaught in_prime_status_cacheandcheck_status_health(httpx raises on non-string URL in test mocks)Related
Requirements
POLL — Status Page Polling & Caching
https://status.anthropic.com/api/v2/status.jsonon a configurable interval via a background refresh loop.run_skillhook fires.GATE — Enforcement Gate
run_skillhandler must check the status cache before launching a headless session and block withasyncio.sleepwhen the API is degraded.status.indicator == "none".run_skillwith a sleep instruction when the cache indicates degradation.CFG — Configuration
StatusGuardConfigdataclass with fields: enabled, cache_max_age, cache_refresh_interval, cache_path, pause_seconds, status_url.defaults.yamland wirable via the standard layered config resolution.OBS — Observability
status_events.jsonlin the autoskillit log directory.get_status_eventsMCP tool must expose the diagnostic log for pipeline debugging.get_status_eventsmust be registered inGATED_TOOLS,TOOL_SUBSET_TAGS, andTOOL_CATEGORIES.WIRE — Integration Wiring
hook_registry.pyin therun_skillmatcher scripts list.tools_kitchen.pymust start and cancel the status refresh task alongside the quota refresh task._write_hook_configmust include thestatus_guardsection for the hook subprocess.ToolContextmust include astatus_refresh_taskfield reusing theQuotaRefreshTaskprotocol.