Skip to content

Add Claude Status Health Guard — pause pipeline on Anthropic API degradation #667

@Trecek

Description

@Trecek

Problem

When Anthropic's status page reports degradation (503s, overload, outages), headless sessions hit API errors and fail. There is no mechanism to detect this proactively and pause before wasting tokens on doomed sessions.

Solution

A status health guard that mirrors the quota guard pattern — polling the Anthropic status page, caching the result, and blocking run_skill when the API is degraded.

Enforcement Strategy

Per #663, the hook-deny-with-sleep-instruction approach is unreliable — the LLM sometimes treats denials as terminal instead of sleeping and retrying. Whatever enforcement mechanism #663 adopts for the quota guard (likely server-side asyncio.sleep in run_skill before launching the headless session), the status health guard should use the same approach from the start.

The implementation should include:

  • Server-side gate in tools_execution.py — check status cache before launching the headless session; if degraded, asyncio.sleep(pause_seconds) then re-check in a loop until healthy
  • PreToolUse hook as defense-in-depth (same fail-open pattern as quota guard) — catches cases where the server-side gate is bypassed
  • Background refresh loop — keeps the cache fresh so recovery is detected during a sleep cycle

Polling & Caching

  1. Background polling — polls https://status.anthropic.com/api/v2/status.json via a refresh loop (like quota_refresh_loop), caching to ~/.claude/autoskillit_status_cache.json
  2. Cache-first checkcheck_status_health() reads cache when fresh, live-fetches on miss/stale
  3. Degradation signalstatus.indicator != "none" (values: minor, major, critical)

API Details

  • Endpoint: https://status.anthropic.com/api/v2/status.json (no auth, standard Statuspage.io)
  • Response: {"status": {"indicator": "none|minor|major|critical", "description": "..."}}
  • Components endpoint: /api/v2/components.json — per-component status for "Claude Code" and "Claude API (api.anthropic.com)"

Components

Layer File Purpose
L1 execution execution/status.py StatusHealth dataclass, _fetch_status(), cache I/O, check_status_health(), _refresh_status_cache()
Hook hooks/status_health_guard.py stdlib-only PreToolUse hook (defense-in-depth), fail-open, logs to status_events.jsonl
Config config/settings.py StatusGuardConfig (enabled, cache_max_age, cache_refresh_interval, cache_path, pause_seconds, status_url)
Config config/defaults.yaml status_guard section
Registry hook_registry.py Add to run_skill matcher scripts list
Server server/helpers.py _prime_status_cache(), _status_refresh_loop()
Server server/tools_kitchen.py Wire into open/close kitchen, _write_hook_config
Server server/tools_execution.py Server-side sleep gate in run_skill (match #663 pattern)
Server server/tools_status.py get_status_events diagnostic tool
Pipeline pipeline/context.py status_refresh_task field (reuse QuotaRefreshTask protocol)
Core core/_type_constants.py Register get_status_events in GATED_TOOLS, TOOL_SUBSET_TAGS, TOOL_CATEGORIES

Config Defaults

status_guard:
  enabled: true
  cache_max_age: 300
  cache_refresh_interval: 240
  cache_path: "~/.claude/autoskillit_status_cache.json"
  pause_seconds: 1800   # 30 minutes
  status_url: "https://status.anthropic.com/api/v2/status.json"

Design Invariants

Related

Requirements

POLL — Status Page Polling & Caching

  • REQ-POLL-001: The system must poll https://status.anthropic.com/api/v2/status.json on a configurable interval via a background refresh loop.
  • REQ-POLL-002: The system must cache the poll result to a configurable file path with a timestamp for freshness validation.
  • REQ-POLL-003: The system must prime the status cache at kitchen open before any run_skill hook fires.
  • REQ-POLL-004: The system must cancel the background refresh loop when the kitchen closes.

GATE — Enforcement Gate

  • REQ-GATE-001: The run_skill handler must check the status cache before launching a headless session and block with asyncio.sleep when the API is degraded.
  • REQ-GATE-002: The server-side sleep must re-check status after each pause cycle and resume only when status.indicator == "none".
  • REQ-GATE-003: A PreToolUse hook must serve as defense-in-depth, denying run_skill with a sleep instruction when the cache indicates degradation.
  • REQ-GATE-004: The hook must be stdlib-only with no autoskillit imports.
  • REQ-GATE-005: The hook must fail open on missing, stale, or corrupt cache.
  • REQ-GATE-006: The enforcement mechanism must mirror whatever pattern Quota guard hook inconsistently blocks run_skill instead of triggering automatic wait-and-retry #663 establishes for the quota guard.

CFG — Configuration

  • REQ-CFG-001: The system must provide a StatusGuardConfig dataclass with fields: enabled, cache_max_age, cache_refresh_interval, cache_path, pause_seconds, status_url.
  • REQ-CFG-002: The default pause duration must be 1800 seconds (30 minutes).
  • REQ-CFG-003: The configuration must be exposed in defaults.yaml and wirable via the standard layered config resolution.

OBS — Observability

  • REQ-OBS-001: The hook must log events (approved, blocked, cache_miss) to status_events.jsonl in the autoskillit log directory.
  • REQ-OBS-002: A get_status_events MCP tool must expose the diagnostic log for pipeline debugging.
  • REQ-OBS-003: get_status_events must be registered in GATED_TOOLS, TOOL_SUBSET_TAGS, and TOOL_CATEGORIES.

WIRE — Integration Wiring

  • REQ-WIRE-001: The hook must be registered in hook_registry.py in the run_skill matcher scripts list.
  • REQ-WIRE-002: tools_kitchen.py must start and cancel the status refresh task alongside the quota refresh task.
  • REQ-WIRE-003: _write_hook_config must include the status_guard section for the hook subprocess.
  • REQ-WIRE-004: ToolContext must include a status_refresh_task field reusing the QuotaRefreshTask protocol.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions