feat(selfupdate): backend-pushed in-process self-update with auto-rollback#55
Open
ysyneu wants to merge 15 commits into
Open
feat(selfupdate): backend-pushed in-process self-update with auto-rollback#55ysyneu wants to merge 15 commits into
ysyneu wants to merge 15 commits into
Conversation
… deadline rollback, single-flight
HIGH: commit now requires a successful welcome handshake, not a bare TCP
connect — Connect returns an error (and drops the socket) when readWelcomeMessage
fails instead of swallowing it, so a broken new binary can no longer "commit" and
delete its .bak rollback target. onConnected fires only on the success path.
HIGH: add a probation deadline (120s) driving failure-mode-② rollback ("boots
but never completes the WS handshake"). Independent of maxAttempts — under the
supported maxAttempts=0 (unlimited reconnect) the process never exits, so the
boot-attempt counter could never trigger this rollback. New ProbationMgr.RollbackNow.
MED: single-flight upgrades via Handler.upgrading (atomic CAS) — a concurrent/
re-pushed upgrade is rejected so two Apply() calls can't race the swap and clobber
.bak; the same flag freezes new task intake so no live task is vaporized by
syscall.Exec mid-swap (rejected with a retriable result).
MED: install.sh now force-migrates a legacy /usr/local/bin layout to the writable
$STATE_DIR/bin layout even when the version already matches — otherwise an
already-current legacy runner is stranded and can never self-update.
LOW: eliminate the post-rename "no runnable binary" window — swap is now copy
current->.bak then a SINGLE atomic rename(new->current), so the canonical path is
never missing. Remove dead Marker.Committed/Error fields + the dead CheckOnBoot
branch.
Tests: RollbackNow (restore/no-op/report-rolled-back) and Apply-checksum-mismatch
(no swap, no .bak, no marker, phase=failed). go build/vet, gofumpt, golangci-lint
(0 issues) and go test ./... all green.
Completes the A11 pre-release guard. Previously a pre-release skipped only the 'latest' pointer but still re-published install.sh to the production CDN unconditionally — so a preview tag cut from an unmerged branch would ship an unreviewed installer to every fresh `curl install.sh | sh`. Move the install.sh refresh into the stable-only branch alongside the latest pointer so a pre-release mutates nothing production-facing; it only uploads its own versioned artifacts under releases/download/<tag>/.
Windows file modes don't carry a Unix executable bit, so the TestDownloadVerifyExtract perm check failed on windows-latest CI. The runner only ever self-updates on systemd-Linux; guard the assertion with runtime.GOOS so the (still meaningful) download/verify/extract coverage runs on every platform while the Unix-only bit check is skipped on Windows.
The one-line installer now produces self-updating runners (writable state-dir binary owned by the service user); note that, and add a callout that the manual /usr/local/bin install can't self-update (root-owned binary, non-root service user) so it won't receive backend-pushed upgrades. EN + ZH.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds backend-pushed in-process self-update with auto-rollback to the systemd-Linux BYOC runner. This is the runner half of a two-repo feature; the Flashduty (fc-safari) control plane that decides when to push an upgrade lands in a separate PR.
Why
Customer BYOC runners are long-lived systemd services on hosts we don't operate. Today upgrading them means a human re-running
install.shon every host. This lets the backend push a target version over the existing WebSocket control channel; the runner downloads, verifies, atomically swaps its own binary, re-execs, and auto-rolls-back if the new binary can't re-establish a healthy session within a probation window.Mechanism
protocol/messages.go) — two new message types over the existing channel:upgrade(Flashduty → Runner):{target_version, download_url, sha256}upgrade.status(Runner → Flashduty):{target_version, phase, error}with 8 phases (received/deferred/downloading/verifying/swapping/committed/rolled_back/failed).selfupdate/update.go) — streams the.tar.gz, checks SHA-256 before touching disk, extracts theflashduty-runnerentry tocurrent.new.current.bak, then a singlerename(new → current). Rename is ETXTBSY-safe (swaps the dir entry, not the open text segment). A pre-swap failure (bad checksum, short download) never touches the running binary — proven byTestApplyChecksumMismatchLeavesCurrentUntouched.syscall.Exec) into the freshly-swapped binary, preserving argv/env. systemd sees the same PID.selfupdate/probation.go,cmd/main.go):Marker{TargetVersion, Attempts, RolledBack}) survives re-exec and systemd restarts.Attempts; atMaxAttempts(3) the runner restorescurrent.bakand re-execs the old binary.probationDeadline(120s) timer firesRollbackNow()if the new binary never completes a WebSocket handshake..bakremoved) only after a successful handshake — not merely after TCP connect. This was an adversarial-review fix (see below).ws/handler.go) — anupgrading atomic.Boolsingle-flights the upgrade (CompareAndSwap) and freezes acceptance of new task requests while a swap is in flight; released on terminal phase.install.sh) — underProtectSystem=strictthe non-privilegedflashdutyuser can't write/usr/local/bin. The real binary now lives at$STATE_DIR/bin/flashduty-runner(aReadWritePathsdir) with a/usr/local/binsymlink for humans.do_installforce-migrates a legacy layout (real binary inINSTALL_DIR) even when the version already matches..github/workflows/goreleaser.yml) — pre-release tags (-rc/-beta/-alpha/-pre) publish artifacts but skip moving the mirrorlatestpointer, so a preview build can be cut without pulling production runners onto it.Adversarial review + fixes
A skeptical review pass found and I fixed:
readWelcomeMessagefailure closes the conn and returns an error;onConnected/commit fire only after a successful handshake.MaxAttempts=0meant unlimited attempts (never rolls back). Now bounded.RollbackNow) added for the hung-but-not-crashing case; legacy-layout migration on matching version; checksum-mismatch leaves current untouched.Verification
go build ./...→ exit 0go vet ./...→ exit 0golangci-lint→ 0 issuesgo test ./...→ exit 0 (all packages pass;selfupdatecovers swap, marker persistence, both rollback failure modes, commit, and the no-marker noop guard)systemd Restart=alwaysagainst a minimal WS server speaking the locked wire protocol. Both scenarios passed — full self-update (downloading → … → committed) and seeded-marker crash-loop rollback (… → rolled_back).Honest caveats: E2E ran on darwin with a shell while-loop standing in for systemd (the swap/re-exec/marker logic is OS-agnostic; the Linux-only pieces are
ProtectSystem/ReadWritePaths, exercised byinstall.shreview not runtime). The crash-loop rollback path is covered by unit tests + a seeded-marker E2E rather than an organically crashing binary.Not included (intentionally)