Skip to content

Add MAINTENANCE state for safer cluster maintenance operations #799

@anawarkar

Description

@anawarkar

Trino Gateway currently lacks a way to distinguish between unexpected cluster failures and planned maintenance windows. This creates challenges for production operations:

  1. Orchestration Integration: No standard way for automation systems to signal "this cluster is intentionally offline"
  2. Routing Coordination: Queries may be routed to clusters undergoing planned work
  3. Observability: Cannot distinguish planned downtime from failures in metrics and alerts
  4. Multi-Team Operations: No clear signal about maintenance state across teams

I propose to add a MAINTENANCE state to the TrinoStatus enum with two-level protection:

Level 1: Routing Protection (Hard Block)

  • Clusters in MAINTENANCE are automatically excluded from query routing
  • Prevents queries from being routed to clusters undergoing maintenance
  • Works automatically through existing routing filters (only HEALTHY clusters are routed to)

Level 2: Operator Protection (Soft Warning)

  • UI displays clear visual indicator (orange badge) for clusters in maintenance
  • Operators attempting to activate a cluster in maintenance receive a confirmation dialog
  • Allows emergency overrides while preventing accidental interference
  • Operator override wins: Activating/deactivating a cluster via UI or API takes precedence over maintenance state
  • All override actions are logged for audit purposes

Use Cases

  1. Blue/Green Deployments: Mark the old version for maintenance during version rollout
  2. Infrastructure Maintenance: OS patches, hardware upgrades, network configuration changes
  3. Capacity Management: Planned cluster resizing or configuration updates
  4. Automated Orchestration: Integration with cluster management systems (Kubernetes operators, ACM, etc.)
  5. Multi-Team Coordination: Clear signal to all teams that a cluster is intentionally offline

I'm open to suggestions and discussion over any concerns. I welcome inputs on the best path forward.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions