feat(nomad): add delegated-run provider#677
Conversation
Register Nomad as a delegated-run Linux provider and add the safe config, flag, client, and doctor foundation needed by later lifecycle work. Keep ACL token resolution env-only through NOMAD_TOKEN or tokenEnv, expose redacted config/readiness through existing CLI surfaces, and leave run lifecycle methods explicitly unsupported for this Wave 1 branch.
Implement Nomad jobspec rendering, Crabbox ownership metadata, local claim labels, allocation readiness, retained warmup, list/status views, safe stop, and owned cleanup with dry-run support. Keep run execution and archive sync deferred to the next Nomad wave while advertising only the cleanup feature now that CleanupBackend is implemented and tested.
Implement Nomad delegated run execution through allocation exec, including archive sync over exec stdin, command stdout/stderr streaming, exit-code propagation, timing JSON, and one-shot cleanup for non-kept runs. Advertise archive-sync only after the exec-backed sync path is covered by deterministic Nomad/provider/CLI tests.
Document the implemented Nomad delegated-run provider surface, add provider matrix metadata/generated output, and wire an opt-in live smoke wrapper with offline gating coverage.
Avoid nondeterministic start failure classification when the helper process exit and readiness timeout are both ready in the start loop. Prefer the completed helper exit so diagnostics report the underlying process failure instead of a misleading timeout.
Deregister Nomad jobs when warmup or run setup fails before a lease claim is written, and let cleanup remove missing-job claims regardless of expiry. This avoids orphaned jobs and stale local claims after failed setup paths.
Nomad delegated runs do not advertise run-session, so returned RunResult values must not include session handles. Keep cleanup behavior through owned job deregistration and retain rerun guidance through stderr instead.
Allow absolute unix socket Nomad API URLs while rejecting relative or authority-bearing unix addresses. Cover accepted and rejected forms in the Nomad client/config tests.
Prevent repository config from redirecting Nomad env-token credentials by trust-gating connection fields and recording destination provenance. Also attach request contexts to Nomad API calls that support query/write options so readiness and cleanup timeouts can cancel stalled requests.
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|
Warning Review the following alerts detected in dependencies. According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.
|
|
Codex review: needs real behavior proof before merge. Reviewed June 25, 2026, 12:24 AM ET / 04:24 UTC. Summary Reproducibility: not applicable. this is a new provider PR rather than a broken existing behavior report. The branch has offline tests and smoke wiring, but no high-confidence live Nomad proof was provided. Review metrics: 3 noteworthy metrics.
Root-cause cluster Members:
Proposal only: this assessment does not dispatch repair, suppress jobs, mutate sibling items, close, or merge anything. Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Proof guidance:
Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Land the Nomad provider only after redacted live-cluster proof and explicit maintainer acceptance of the SDK, ACL, jobspec, cleanup, and dependency contract. Do we have a high-confidence way to reproduce the issue? Not applicable; this is a new provider PR rather than a broken existing behavior report. The branch has offline tests and smoke wiring, but no high-confidence live Nomad proof was provided. Is this the best way to solve the issue? Unclear until the remaining merge gates are satisfied. The delegated-run shape matches the linked request, but live cluster proof and maintainer acceptance of the dependency/provider contract are still required before this is the best landing path. AGENTS.md: found and applied where relevant. Codex review notes: model internal, reasoning high; reviewed against 0ec69d642764. Label changesLabel justifications:
Evidence reviewedSecurity concerns:
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
Reject reused Nomad leases when the requested workdir no longer matches the local ownership claim, so reruns cannot silently relabel and sync against the wrong remote directory. Also reject broad system workdirs and make the default image requirement apply only to image-backed drivers unless a custom jobspec template owns the task config. Verification: go test ./internal/providers/nomad -run 'TestValidateConfig|TestJobspec|TestRunRejectsReusedLeaseWithDifferentWorkdir|TestRunNoSync|TestRunKeep|TestRunSync' Verification: go test ./internal/providers/nomad ./internal/providers/all ./internal/cli
Escape trusted jobspec placeholder substitutions as JSON string fragments, prefer replacement allocations over stale terminal allocations, and honor the warmup timing JSON contract for Nomad delegated runs. Verification: go test ./internal/providers/nomad -run 'TestJobspec|TestWarmup|TestSelectAllocation|TestValidateConfig|TestRunRejectsReusedLeaseWithDifferentWorkdir' Verification: go test ./internal/providers/nomad ./internal/providers/all ./internal/cli
Avoid carrying the default image into non-image Nomad drivers such as raw_exec, and replace the GNU-only sleep infinity keepalive with a portable shell loop. Verification: go test ./internal/providers/nomad -run 'TestJobspec|TestValidateConfig|TestWarmup|TestSelectAllocation|TestRunRejectsReusedLeaseWithDifferentWorkdir' Verification: go test ./internal/providers/nomad ./internal/providers/all ./internal/cli
Drop Nomad-local helpers that no longer have call sites after the provider integration and review-fix passes. This keeps the provider surface smaller without changing runtime behavior.
Verification: rg -n 'formatScope\(|blank\(|newLeaseSlug\(|normalizeLeaseSlug\(' internal/providers/nomad -S
Verification: go test ./internal/providers/nomad
Keep repository-local Nomad YAML from changing credentialed job execution settings such as region, namespace, driver, task, placement, resources, and timeouts. This prevents an untrusted checkout from reusing inherited Nomad credentials while switching the job to a different isolation or placement boundary.
Wait briefly for a concurrent helper process exit when the readiness deadline fires so startup reports the real helper failure instead of racing into a misleading timeout. This keeps the timeout path intact while making the early-exit diagnostic deterministic.
Propagate Nomad one-shot cleanup failures after successful delegated runs instead of only printing a warning. This keeps successful non-keep runs from reporting exit 0 while a Nomad job or local claim remains active.
Return a nonzero run exit code when Nomad allocation exec fails at the transport layer. This keeps RunResult and timing JSON from reporting exitCode 0 for failed delegated runs.
Use Nomad's global region for empty region configuration and require the configured task to be running before allocation readiness reports success. This keeps local claims compatible with Nomad defaults and prevents exec attempts before the task is executable.
Include the configured Nomad address in the local ownership scope and remote ownership fingerprint. This prevents cleanup or reuse against another Nomad cluster from deleting claims for jobs that are still running elsewhere.
Keep the Nomad live smoke from reading repo-local config for address or token-env destinations. The script re-emits those values as explicit CLI flags while running with live credentials, so only environment variables or an explicit trusted CRABBOX_CONFIG path should influence them.
Update the Nomad provider docs to match the current portable keepalive command and address-inclusive ownership scope used by the implementation.
|
Addressed the two actionable ClawSweeper docs findings in 32349af:
I cannot add real live Nomad proof from this environment because no live Nomad address/token config is available here. The Nomad SDK dependency acceptance remains maintainer-owned. @clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. |
Summary
Closes #380.
Verification
go test ./internal/providers/nomad -run 'Run|Unix|Config|WarmupCleans|CleanupRemoves'go test ./internal/providers/nomad ./internal/providers/all ./internal/cligo test ./...go vet ./...node scripts/generate-provider-matrix.mjs --checkbash scripts/check-docs.shbash -n scripts/live-nomad-smoke.shnode --test scripts/live-nomad-smoke.test.jsnode --test scripts/live-smoke.test.jsgo build -trimpath -o bin/crabbox ./cmd/crabboxbin/crabbox providers --jsonbin/crabbox providers filters --jsonscripts/live-nomad-smoke.shclassified non-live run asenvironment_blockedwith reasonCRABBOX_LIVE_not_enabled.Review Notes