Skip to content

feat(selfupdate): backend-pushed in-process self-update with auto-rollback#55

Open
ysyneu wants to merge 15 commits into
mainfrom
feat/auto-upgrade
Open

feat(selfupdate): backend-pushed in-process self-update with auto-rollback#55
ysyneu wants to merge 15 commits into
mainfrom
feat/auto-upgrade

Conversation

@ysyneu
Copy link
Copy Markdown
Collaborator

@ysyneu ysyneu commented Jun 6, 2026

What

Adds backend-pushed in-process self-update with auto-rollback to the systemd-Linux BYOC runner. This is the runner half of a two-repo feature; the Flashduty (fc-safari) control plane that decides when to push an upgrade lands in a separate PR.

Why

Customer BYOC runners are long-lived systemd services on hosts we don't operate. Today upgrading them means a human re-running install.sh on every host. This lets the backend push a target version over the existing WebSocket control channel; the runner downloads, verifies, atomically swaps its own binary, re-execs, and auto-rolls-back if the new binary can't re-establish a healthy session within a probation window.

Mechanism

  1. Wire protocol (protocol/messages.go) — two new message types over the existing channel:
    • upgrade (Flashduty → Runner): {target_version, download_url, sha256}
    • upgrade.status (Runner → Flashduty): {target_version, phase, error} with 8 phases (received/deferred/downloading/verifying/swapping/committed/rolled_back/failed).
  2. Download + verify + extract (selfupdate/update.go) — streams the .tar.gz, checks SHA-256 before touching disk, extracts the flashduty-runner entry to current.new.
  3. Atomic swap — copy running binary to current.bak, then a single rename(new → current). Rename is ETXTBSY-safe (swaps the dir entry, not the open text segment). A pre-swap failure (bad checksum, short download) never touches the running binary — proven by TestApplyChecksumMismatchLeavesCurrentUntouched.
  4. Re-exec (syscall.Exec) into the freshly-swapped binary, preserving argv/env. systemd sees the same PID.
  5. Probation + auto-rollback (selfupdate/probation.go, cmd/main.go):
    • A persisted marker (Marker{TargetVersion, Attempts, RolledBack}) survives re-exec and systemd restarts.
    • Failure mode ① — crash loop: each boot under an unconfirmed marker increments Attempts; at MaxAttempts (3) the runner restores current.bak and re-execs the old binary.
    • Failure mode ② — hung/unhealthy new binary: a probationDeadline (120s) timer fires RollbackNow() if the new binary never completes a WebSocket handshake.
    • Commit: the marker is cleared (and .bak removed) only after a successful handshake — not merely after TCP connect. This was an adversarial-review fix (see below).
  6. Idle gating (ws/handler.go) — an upgrading atomic.Bool single-flights the upgrade (CompareAndSwap) and freezes acceptance of new task requests while a swap is in flight; released on terminal phase.
  7. Writable-dir relocation (install.sh) — under ProtectSystem=strict the non-privileged flashduty user can't write /usr/local/bin. The real binary now lives at $STATE_DIR/bin/flashduty-runner (a ReadWritePaths dir) with a /usr/local/bin symlink for humans. do_install force-migrates a legacy layout (real binary in INSTALL_DIR) even when the version already matches.
  8. Release safety (.github/workflows/goreleaser.yml) — pre-release tags (-rc/-beta/-alpha/-pre) publish artifacts but skip moving the mirror latest pointer, so a preview build can be cut without pulling production runners onto it.

Adversarial review + fixes

A skeptical review pass found and I fixed:

  • HIGH — commit fired on TCP connect, not handshake → a binary that connects but fails the welcome handshake would be wrongly committed and never rolled back. Now readWelcomeMessage failure closes the conn and returns an error; onConnected/commit fire only after a successful handshake.
  • HIGHMaxAttempts=0 meant unlimited attempts (never rolls back). Now bounded.
  • MED — deadline-rollback (RollbackNow) added for the hung-but-not-crashing case; legacy-layout migration on matching version; checksum-mismatch leaves current untouched.

Verification

  • go build ./... → exit 0
  • go vet ./... → exit 0
  • golangci-lint → 0 issues
  • go test ./... → exit 0 (all packages pass; selfupdate covers swap, marker persistence, both rollback failure modes, commit, and the no-marker noop guard)
  • E2E (driver in a scratch dir, not committed): built real vA/vB darwin binaries from this branch, packaged tar.gz, ran a while-loop supervisor emulating systemd Restart=always against a minimal WS server speaking the locked wire protocol. Both scenarios passed — full self-update (downloading → … → committed) and seeded-marker crash-loop rollback (… → rolled_back).

Honest caveats: E2E ran on darwin with a shell while-loop standing in for systemd (the swap/re-exec/marker logic is OS-agnostic; the Linux-only pieces are ProtectSystem/ReadWritePaths, exercised by install.sh review not runtime). The crash-loop rollback path is covered by unit tests + a seeded-marker E2E rather than an organically crashing binary.

Not included (intentionally)

  • The fc-safari control-plane half (decides when to push, canary gating, status ingestion) — separate PR.
  • Docker / macOS runner upgrade — out of scope by design (systemd-Linux BYOC only).

ysyneu added 15 commits June 7, 2026 00:54
… deadline rollback, single-flight

HIGH: commit now requires a successful welcome handshake, not a bare TCP
connect — Connect returns an error (and drops the socket) when readWelcomeMessage
fails instead of swallowing it, so a broken new binary can no longer "commit" and
delete its .bak rollback target. onConnected fires only on the success path.

HIGH: add a probation deadline (120s) driving failure-mode-② rollback ("boots
but never completes the WS handshake"). Independent of maxAttempts — under the
supported maxAttempts=0 (unlimited reconnect) the process never exits, so the
boot-attempt counter could never trigger this rollback. New ProbationMgr.RollbackNow.

MED: single-flight upgrades via Handler.upgrading (atomic CAS) — a concurrent/
re-pushed upgrade is rejected so two Apply() calls can't race the swap and clobber
.bak; the same flag freezes new task intake so no live task is vaporized by
syscall.Exec mid-swap (rejected with a retriable result).

MED: install.sh now force-migrates a legacy /usr/local/bin layout to the writable
$STATE_DIR/bin layout even when the version already matches — otherwise an
already-current legacy runner is stranded and can never self-update.

LOW: eliminate the post-rename "no runnable binary" window — swap is now copy
current->.bak then a SINGLE atomic rename(new->current), so the canonical path is
never missing. Remove dead Marker.Committed/Error fields + the dead CheckOnBoot
branch.

Tests: RollbackNow (restore/no-op/report-rolled-back) and Apply-checksum-mismatch
(no swap, no .bak, no marker, phase=failed). go build/vet, gofumpt, golangci-lint
(0 issues) and go test ./... all green.
Completes the A11 pre-release guard. Previously a pre-release skipped only
the 'latest' pointer but still re-published install.sh to the production CDN
unconditionally — so a preview tag cut from an unmerged branch would ship an
unreviewed installer to every fresh `curl install.sh | sh`. Move the
install.sh refresh into the stable-only branch alongside the latest pointer so
a pre-release mutates nothing production-facing; it only uploads its own
versioned artifacts under releases/download/<tag>/.
Windows file modes don't carry a Unix executable bit, so the
TestDownloadVerifyExtract perm check failed on windows-latest CI. The runner
only ever self-updates on systemd-Linux; guard the assertion with
runtime.GOOS so the (still meaningful) download/verify/extract coverage runs
on every platform while the Unix-only bit check is skipped on Windows.
The one-line installer now produces self-updating runners (writable state-dir
binary owned by the service user); note that, and add a callout that the manual
/usr/local/bin install can't self-update (root-owned binary, non-root service
user) so it won't receive backend-pushed upgrades. EN + ZH.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant