Skip to content

feat: operational troubleshooting UI — service health, task states, and error visibility #5

@bakeb7j0

Description

@bakeb7j0

Summary

Add operational troubleshooting visibility to the SwarmCD UI — surface Docker Swarm service health, task states, error messages, and replica counts so operators can diagnose deployment problems without SSH access or Docker CLI commands.

This is a follow-on to #4 (REST API for runtime stack management), which adds timestamps and health metadata. This issue goes deeper: exposing the per-service and per-task (container) state that operators need when something goes wrong.

Use Cases

UC-1: Failed deployment diagnosis

After a reconciliation cycle deploys a new image version, some tasks fail to start (bad image tag, missing config, resource constraints). The operator needs to see which tasks failed and why — currently this requires docker service ps <service> via SSH. With this feature, the UI shows it directly.

UC-2: Replica health at a glance

Operators need to quickly assess whether all services in a stack are healthy. "3/3 running" vs "1/3 running, 2 failed" should be visible without clicking into anything — a summary on the stack card itself.

UC-3: Image pull and convergence errors

When Docker Swarm can't pull an image, schedule a task, or meet placement constraints, the error appears in the task state. These are the most common operational issues and the hardest to diagnose without direct Docker access.

UC-4: Post-restart verification

After using POST /stacks/{name}/restart (from #4), operators need to verify the restart succeeded — tasks came back up, no crash loops, correct image versions. The UI should show this without requiring CLI access.

Detailed Design

New API Endpoints

GET /stacks/{name}/services (new)

Returns all services in a stack with health summary.

Response: 200 OK

{
  "stack": "blueshift",
  "services": [
    {
      "id": "abc123",
      "name": "blueshift_web",
      "image": "registry.example.com/web:v2.1.0",
      "replicas_running": 3,
      "replicas_desired": 3,
      "update_status": "completed",
      "ports": [{"published": 8080, "target": 80}],
      "last_updated": "2026-03-21T20:24:00Z"
    },
    {
      "id": "def456",
      "name": "blueshift_worker",
      "image": "registry.example.com/worker:v2.1.0",
      "replicas_running": 1,
      "replicas_desired": 2,
      "update_status": "updating",
      "ports": [],
      "last_updated": "2026-03-21T20:25:30Z"
    }
  ]
}

Implementation:

  1. dockerCli.Client().ServiceList() filtered by label com.docker.stack.namespace={name}
  2. For each service, dockerCli.Client().TaskList() to count running vs desired replicas
  3. Map Docker API types to response struct

GET /stacks/{name}/services/{service}/tasks (new)

Returns recent tasks (container instances) for a service, including failed ones.

Response: 200 OK

{
  "service": "blueshift_worker",
  "tasks": [
    {
      "id": "task-jkl012",
      "state": "running",
      "desired_state": "running",
      "node": "swarm-node-02",
      "started_at": "2026-03-21T20:25:00Z",
      "error": ""
    },
    {
      "id": "task-mno345",
      "state": "failed",
      "desired_state": "running",
      "node": "swarm-node-01",
      "started_at": "2026-03-21T20:24:50Z",
      "stopped_at": "2026-03-21T20:24:55Z",
      "error": "task: non-zero exit (137): OOMKilled",
      "exit_code": 137
    },
    {
      "id": "task-pqr678",
      "state": "rejected",
      "desired_state": "running",
      "node": "",
      "error": "no suitable node (2 nodes not available for new tasks)",
      "started_at": "2026-03-21T20:24:45Z"
    }
  ]
}

Implementation:

  1. Resolve full service name (prepend stack prefix if needed, same logic as restart endpoint in feat: REST API for runtime stack management #4)
  2. dockerCli.Client().TaskList() filtered by service ID
  3. Return last N tasks (default 20, configurable via ?limit= query param) sorted by most recent first
  4. Include both current and historical tasks (failed, rejected, shutdown) — this is critical for diagnosing issues

Docker API Data Sources

All data comes from the Docker Engine API (already available via dockerCli.Client()):

Data needed Docker API call Key fields
Services in stack ServiceList() with com.docker.stack.namespace label filter Spec.Name, Spec.TaskTemplate.ContainerSpec.Image, Spec.Mode.Replicated.Replicas
Running replica count TaskList() filtered by service + state=running Count of tasks with Status.State == "running"
Task state and errors TaskList() filtered by service Status.State, Status.Err, Status.Message, Status.ContainerStatus.ExitCode
Update/rollback status ServiceInspect() UpdateStatus.State (updating, paused, completed, rollback_started, etc.)
Node placement TaskList() NodeID (resolve to hostname via NodeInspect() if needed)

UI Changes

Stack card enhancement

Add a health summary badge to each stack card:

┌─────────────────────────────────────┐
│ Name: blueshift                     │
│ Watching: branch:main               │
│ Revision: 96f59040                  │
│ Last Changed: 2h ago                │
│ Last Deployed: 2h ago               │
│ Services: 3/3 healthy    [Details]  │  ← NEW: summary + link
│ Repo URL: https://...               │
└─────────────────────────────────────┘

The "3/3 healthy" count is derived from the services endpoint: a service is "healthy" if replicas_running == replicas_desired.

Stack detail view (new)

Clicking "Details" (or the stack card) opens a detail view showing:

  1. Service list — table with columns: Service Name, Image, Replicas (running/desired), Update Status
  2. Expandable rows — click a service to see its recent tasks
  3. Task list — table with columns: Task ID (short), State (with color coding), Node, Started, Error
  4. Color coding for task state:
    • Running → green
    • Failed/rejected → red
    • Shutdown/complete → gray
    • Preparing/starting → yellow/amber

Error highlighting

When any task is in a failed/rejected state, the stack card's health badge turns red and shows the count:

Services: 2/3 healthy (1 failing)   [Details]

Scope Boundaries

In scope

  • Service list with replica counts and health summary
  • Task list with state, errors, exit codes, and node placement
  • Stack card health badges
  • Stack detail view with expandable service/task hierarchy

Out of scope (deferred)

  • Container logsServiceLogs() is a streaming API with pagination concerns; significant complexity for V1. Operators can use docker service logs via CLI for now.
  • Real-time updates — polling GET /stacks/{name}/services on an interval (same pattern as current /stacks polling) is sufficient. WebSocket/SSE is deferred.
  • Service-level actions from the UI — restart/scale buttons in the detail view require auth integration with the UI (deferred per feat: REST API for runtime stack management #4 decision)
  • Historical trend data — no time-series storage; just current snapshot from Docker API

Implementation Steps

Step 1: Backend API

  • Add GET /stacks/{name}/services endpoint (no auth — read-only)
  • Add GET /stacks/{name}/services/{service}/tasks endpoint (no auth — read-only)
  • Use dockerCli.Client() for Docker Engine API calls
  • Add response structs with json tags
  • Add unit tests with mocked Docker client

Step 2: Stack card health badge

  • Call GET /stacks/{name}/services from the UI (can poll alongside existing /stacks call, or fetch per-card)
  • Add health summary badge to StatusCard component
  • Color code: all healthy → green, some failing → red, no data yet → gray
  • Update StatusCard tests

Step 3: Stack detail view

  • Add a detail/modal view component (triggered by clicking card or "Details" link)
  • Service list table with replica counts
  • Expandable service rows showing task list
  • Task state color coding
  • Add component tests

Acceptance Criteria

Must Have

  • GET /stacks/{name}/services returns services with replica counts, image, update status
  • GET /stacks/{name}/services/{service}/tasks returns recent tasks with state, error, exit code, node
  • Stack cards show health summary badge (X/Y healthy)
  • Stack detail view shows service list with replica counts
  • Stack detail view shows task list per service with state and errors
  • Failed/rejected tasks are visually highlighted (red)
  • All new endpoints have unit tests
  • UI components have tests

Should Have

  • Node hostname resolution (show hostname instead of node ID)
  • Task list ?limit= query parameter for pagination
  • Update/rollback status displayed on service rows

Won't Have (deferred)

  • Container logs (streaming API complexity)
  • Real-time updates (WebSocket/SSE)
  • Service actions from UI (restart/scale buttons)
  • Historical trend data

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions