You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add operational troubleshooting visibility to the SwarmCD UI — surface Docker Swarm service health, task states, error messages, and replica counts so operators can diagnose deployment problems without SSH access or Docker CLI commands.
This is a follow-on to #4 (REST API for runtime stack management), which adds timestamps and health metadata. This issue goes deeper: exposing the per-service and per-task (container) state that operators need when something goes wrong.
Use Cases
UC-1: Failed deployment diagnosis
After a reconciliation cycle deploys a new image version, some tasks fail to start (bad image tag, missing config, resource constraints). The operator needs to see which tasks failed and why — currently this requires docker service ps <service> via SSH. With this feature, the UI shows it directly.
UC-2: Replica health at a glance
Operators need to quickly assess whether all services in a stack are healthy. "3/3 running" vs "1/3 running, 2 failed" should be visible without clicking into anything — a summary on the stack card itself.
UC-3: Image pull and convergence errors
When Docker Swarm can't pull an image, schedule a task, or meet placement constraints, the error appears in the task state. These are the most common operational issues and the hardest to diagnose without direct Docker access.
UC-4: Post-restart verification
After using POST /stacks/{name}/restart (from #4), operators need to verify the restart succeeded — tasks came back up, no crash loops, correct image versions. The UI should show this without requiring CLI access.
Detailed Design
New API Endpoints
GET /stacks/{name}/services (new)
Returns all services in a stack with health summary.
NodeID (resolve to hostname via NodeInspect() if needed)
UI Changes
Stack card enhancement
Add a health summary badge to each stack card:
┌─────────────────────────────────────┐
│ Name: blueshift │
│ Watching: branch:main │
│ Revision: 96f59040 │
│ Last Changed: 2h ago │
│ Last Deployed: 2h ago │
│ Services: 3/3 healthy [Details] │ ← NEW: summary + link
│ Repo URL: https://... │
└─────────────────────────────────────┘
The "3/3 healthy" count is derived from the services endpoint: a service is "healthy" if replicas_running == replicas_desired.
Stack detail view (new)
Clicking "Details" (or the stack card) opens a detail view showing:
Service list — table with columns: Service Name, Image, Replicas (running/desired), Update Status
Expandable rows — click a service to see its recent tasks
Task list — table with columns: Task ID (short), State (with color coding), Node, Started, Error
Color coding for task state:
Running → green
Failed/rejected → red
Shutdown/complete → gray
Preparing/starting → yellow/amber
Error highlighting
When any task is in a failed/rejected state, the stack card's health badge turns red and shows the count:
Services: 2/3 healthy (1 failing) [Details]
Scope Boundaries
In scope
Service list with replica counts and health summary
Task list with state, errors, exit codes, and node placement
Stack card health badges
Stack detail view with expandable service/task hierarchy
Out of scope (deferred)
Container logs — ServiceLogs() is a streaming API with pagination concerns; significant complexity for V1. Operators can use docker service logs via CLI for now.
Real-time updates — polling GET /stacks/{name}/services on an interval (same pattern as current /stacks polling) is sufficient. WebSocket/SSE is deferred.
Service-level actions from the UI — restart/scale buttons in the detail view require auth integration with the UI (deferred per feat: REST API for runtime stack management #4 decision)
Historical trend data — no time-series storage; just current snapshot from Docker API
Implementation Steps
Step 1: Backend API
Add GET /stacks/{name}/services endpoint (no auth — read-only)
Add GET /stacks/{name}/services/{service}/tasks endpoint (no auth — read-only)
Use dockerCli.Client() for Docker Engine API calls
Add response structs with json tags
Add unit tests with mocked Docker client
Step 2: Stack card health badge
Call GET /stacks/{name}/services from the UI (can poll alongside existing /stacks call, or fetch per-card)
Add health summary badge to StatusCard component
Color code: all healthy → green, some failing → red, no data yet → gray
Update StatusCard tests
Step 3: Stack detail view
Add a detail/modal view component (triggered by clicking card or "Details" link)
Service list table with replica counts
Expandable service rows showing task list
Task state color coding
Add component tests
Acceptance Criteria
Must Have
GET /stacks/{name}/services returns services with replica counts, image, update status
GET /stacks/{name}/services/{service}/tasks returns recent tasks with state, error, exit code, node
Stack cards show health summary badge (X/Y healthy)
Stack detail view shows service list with replica counts
Stack detail view shows task list per service with state and errors
Failed/rejected tasks are visually highlighted (red)
All new endpoints have unit tests
UI components have tests
Should Have
Node hostname resolution (show hostname instead of node ID)
Summary
Add operational troubleshooting visibility to the SwarmCD UI — surface Docker Swarm service health, task states, error messages, and replica counts so operators can diagnose deployment problems without SSH access or Docker CLI commands.
This is a follow-on to #4 (REST API for runtime stack management), which adds timestamps and health metadata. This issue goes deeper: exposing the per-service and per-task (container) state that operators need when something goes wrong.
Use Cases
UC-1: Failed deployment diagnosis
After a reconciliation cycle deploys a new image version, some tasks fail to start (bad image tag, missing config, resource constraints). The operator needs to see which tasks failed and why — currently this requires
docker service ps <service>via SSH. With this feature, the UI shows it directly.UC-2: Replica health at a glance
Operators need to quickly assess whether all services in a stack are healthy. "3/3 running" vs "1/3 running, 2 failed" should be visible without clicking into anything — a summary on the stack card itself.
UC-3: Image pull and convergence errors
When Docker Swarm can't pull an image, schedule a task, or meet placement constraints, the error appears in the task state. These are the most common operational issues and the hardest to diagnose without direct Docker access.
UC-4: Post-restart verification
After using
POST /stacks/{name}/restart(from #4), operators need to verify the restart succeeded — tasks came back up, no crash loops, correct image versions. The UI should show this without requiring CLI access.Detailed Design
New API Endpoints
GET /stacks/{name}/services(new)Returns all services in a stack with health summary.
Response:
200 OK{ "stack": "blueshift", "services": [ { "id": "abc123", "name": "blueshift_web", "image": "registry.example.com/web:v2.1.0", "replicas_running": 3, "replicas_desired": 3, "update_status": "completed", "ports": [{"published": 8080, "target": 80}], "last_updated": "2026-03-21T20:24:00Z" }, { "id": "def456", "name": "blueshift_worker", "image": "registry.example.com/worker:v2.1.0", "replicas_running": 1, "replicas_desired": 2, "update_status": "updating", "ports": [], "last_updated": "2026-03-21T20:25:30Z" } ] }Implementation:
dockerCli.Client().ServiceList()filtered by labelcom.docker.stack.namespace={name}dockerCli.Client().TaskList()to count running vs desired replicasGET /stacks/{name}/services/{service}/tasks(new)Returns recent tasks (container instances) for a service, including failed ones.
Response:
200 OK{ "service": "blueshift_worker", "tasks": [ { "id": "task-jkl012", "state": "running", "desired_state": "running", "node": "swarm-node-02", "started_at": "2026-03-21T20:25:00Z", "error": "" }, { "id": "task-mno345", "state": "failed", "desired_state": "running", "node": "swarm-node-01", "started_at": "2026-03-21T20:24:50Z", "stopped_at": "2026-03-21T20:24:55Z", "error": "task: non-zero exit (137): OOMKilled", "exit_code": 137 }, { "id": "task-pqr678", "state": "rejected", "desired_state": "running", "node": "", "error": "no suitable node (2 nodes not available for new tasks)", "started_at": "2026-03-21T20:24:45Z" } ] }Implementation:
dockerCli.Client().TaskList()filtered by service ID?limit=query param) sorted by most recent firstDocker API Data Sources
All data comes from the Docker Engine API (already available via
dockerCli.Client()):ServiceList()withcom.docker.stack.namespacelabel filterSpec.Name,Spec.TaskTemplate.ContainerSpec.Image,Spec.Mode.Replicated.ReplicasTaskList()filtered by service + state=runningStatus.State == "running"TaskList()filtered by serviceStatus.State,Status.Err,Status.Message,Status.ContainerStatus.ExitCodeServiceInspect()UpdateStatus.State(updating, paused, completed, rollback_started, etc.)TaskList()NodeID(resolve to hostname viaNodeInspect()if needed)UI Changes
Stack card enhancement
Add a health summary badge to each stack card:
The "3/3 healthy" count is derived from the services endpoint: a service is "healthy" if
replicas_running == replicas_desired.Stack detail view (new)
Clicking "Details" (or the stack card) opens a detail view showing:
Error highlighting
When any task is in a failed/rejected state, the stack card's health badge turns red and shows the count:
Scope Boundaries
In scope
Out of scope (deferred)
ServiceLogs()is a streaming API with pagination concerns; significant complexity for V1. Operators can usedocker service logsvia CLI for now.GET /stacks/{name}/serviceson an interval (same pattern as current/stackspolling) is sufficient. WebSocket/SSE is deferred.Implementation Steps
Step 1: Backend API
GET /stacks/{name}/servicesendpoint (no auth — read-only)GET /stacks/{name}/services/{service}/tasksendpoint (no auth — read-only)dockerCli.Client()for Docker Engine API callsjsontagsStep 2: Stack card health badge
GET /stacks/{name}/servicesfrom the UI (can poll alongside existing/stackscall, or fetch per-card)StatusCardcomponentStatusCardtestsStep 3: Stack detail view
Acceptance Criteria
Must Have
GET /stacks/{name}/servicesreturns services with replica counts, image, update statusGET /stacks/{name}/services/{service}/tasksreturns recent tasks with state, error, exit code, nodeShould Have
?limit=query parameter for paginationWon't Have (deferred)