Skip to content

feat(nomad): add delegated-run provider#677

Open
coygeek wants to merge 25 commits into
openclaw:mainfrom
coygeek:nomad-provider-execute-plans
Open

feat(nomad): add delegated-run provider#677
coygeek wants to merge 25 commits into
openclaw:mainfrom
coygeek:nomad-provider-execute-plans

Conversation

@coygeek

@coygeek coygeek commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add the Nomad delegated-run provider with config, doctor, job lifecycle, allocation exec run support, archive sync, list/status/stop/cleanup, and provider registration.
  • Add Nomad provider docs, command docs, metadata/matrix updates, and live smoke dispatcher coverage.
  • Fix closeout review findings for unclaimed job cleanup, missing-claim cleanup, unsupported run-session handles, unix socket addresses, and untrusted repository config credential redirection.
  • Include a focused Apple VZ readiness-timeout fix found while stabilizing the full verification suite.

Closes #380.

Verification

  • go test ./internal/providers/nomad -run 'Run|Unix|Config|WarmupCleans|CleanupRemoves'
  • go test ./internal/providers/nomad ./internal/providers/all ./internal/cli
  • go test ./...
  • go vet ./...
  • node scripts/generate-provider-matrix.mjs --check
  • bash scripts/check-docs.sh
  • bash -n scripts/live-nomad-smoke.sh
  • node --test scripts/live-nomad-smoke.test.js
  • node --test scripts/live-smoke.test.js
  • go build -trimpath -o bin/crabbox ./cmd/crabbox
  • bin/crabbox providers --json
  • bin/crabbox providers filters --json
  • scripts/live-nomad-smoke.sh classified non-live run as environment_blocked with reason CRABBOX_LIVE_not_enabled.

Review Notes

  • Codex autoreview found actionable Nomad lifecycle/run-session/unix-address/config-trust issues; all were fixed in follow-up commits.
  • Final Codex autoreview rerun was quota-blocked after the fix. OpenCode fallback review was auth-blocked, so the PR includes green automated verification plus the fixed earlier review findings, but not a clean final AI-review rerun.

coygeek added 13 commits June 24, 2026 15:14
Register Nomad as a delegated-run Linux provider and add the safe config, flag, client, and doctor foundation needed by later lifecycle work.

Keep ACL token resolution env-only through NOMAD_TOKEN or tokenEnv, expose redacted config/readiness through existing CLI surfaces, and leave run lifecycle methods explicitly unsupported for this Wave 1 branch.
Implement Nomad jobspec rendering, Crabbox ownership metadata, local claim labels, allocation readiness, retained warmup, list/status views, safe stop, and owned cleanup with dry-run support.

Keep run execution and archive sync deferred to the next Nomad wave while advertising only the cleanup feature now that CleanupBackend is implemented and tested.
Implement Nomad delegated run execution through allocation exec, including archive sync over exec stdin, command stdout/stderr streaming, exit-code propagation, timing JSON, and one-shot cleanup for non-kept runs.

Advertise archive-sync only after the exec-backed sync path is covered by deterministic Nomad/provider/CLI tests.
Document the implemented Nomad delegated-run provider surface, add provider matrix metadata/generated output, and wire an opt-in live smoke wrapper with offline gating coverage.
Avoid nondeterministic start failure classification when the helper process exit and readiness timeout are both ready in the start loop. Prefer the completed helper exit so diagnostics report the underlying process failure instead of a misleading timeout.
Deregister Nomad jobs when warmup or run setup fails before a lease claim is written, and let cleanup remove missing-job claims regardless of expiry. This avoids orphaned jobs and stale local claims after failed setup paths.
Nomad delegated runs do not advertise run-session, so returned RunResult values must not include session handles. Keep cleanup behavior through owned job deregistration and retain rerun guidance through stderr instead.
Allow absolute unix socket Nomad API URLs while rejecting relative or authority-bearing unix addresses. Cover accepted and rejected forms in the Nomad client/config tests.
Prevent repository config from redirecting Nomad env-token credentials by trust-gating connection fields and recording destination provenance. Also attach request contexts to Nomad API calls that support query/write options so readiness and cleanup timeouts can cancel stalled requests.
@socket-security

socket-security Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedgolang/​github.com/​hashicorp/​nomad/​api@​v0.0.0-20260624190804-2007237ae08c9010010010070

View full report

@socket-security

socket-security Bot commented Jun 25, 2026

Copy link
Copy Markdown

Warning

Review the following alerts detected in dependencies.

According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.

Action Severity Alert  (click "▶" to expand/collapse)
Warn Medium
Deprecated by its maintainer: golang github.com/mitchellh/go-homedir

Reason: Repository has been archived by the owner.

From: ?golang/github.com/hashicorp/nomad/api@v0.0.0-20260624190804-2007237ae08cgolang/github.com/mitchellh/go-homedir@v1.1.0

ℹ Read more on: This package | This alert | What is a deprecated package?

Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at support@socket.dev.

Suggestion: Research the state of the package and determine if there are non-deprecated versions that can be used, or if it should be replaced with a new, supported solution.

Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment @SocketSecurity ignore golang/github.com/mitchellh/go-homedir@v1.1.0. You can also ignore all packages with @SocketSecurity ignore-all. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.

View full report

@clawsweeper

clawsweeper Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge. Reviewed June 25, 2026, 12:24 AM ET / 04:24 UTC.

Summary
Adds a built-in Nomad delegated-run provider with config, doctor, job lifecycle, allocation exec/archive sync, docs/tests/live-smoke wiring, plus a focused Apple VZ helper readiness-timeout fix.

Reproducibility: not applicable. this is a new provider PR rather than a broken existing behavior report. The branch has offline tests and smoke wiring, but no high-confidence live Nomad proof was provided.

Review metrics: 3 noteworthy metrics.

  • Provider diff size: 37 files changed, +5414/-52. This is a full built-in provider addition, so live proof and product acceptance matter more than for a small patch.
  • Go dependency surface: 1 direct module added; go.sum changed. The new Nomad SDK dependency is the main supply-chain surface maintainers must evaluate before merge.
  • Existing provider runtime touch: 1 Apple VZ helper file changed. The branch is mostly Nomad work but also changes an existing provider helper path that maintainers should account for.

Root-cause cluster
Relationship: fixed_by_candidate
Canonical: #380
Summary: This PR is the candidate implementation for the open Nomad delegated-run provider issue it closes; no other same-repository Nomad canonical item was found.

Members:

Proposal only: this assessment does not dispatch repair, suppress jobs, mutate sibling items, close, or merge anything.

Merge readiness
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🐚 platinum hermit
Result: blocked until real behavior proof is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P1] Add redacted live Nomad proof covering doctor, warmup/run, list/status, stop, cleanup, and archive sync against a real cluster.
  • Get maintainer acceptance of the Nomad SDK dependency path and initial provider contract.

Proof guidance:

  • [P1] Needs real behavior proof before merge: The PR has test commands and an environment-blocked smoke result, but no redacted live Nomad terminal output, logs, recording, or artifact showing the changed behavior after the fix; contributors should redact IPs, API keys, phone numbers, non-public endpoints, and other private details, then update the PR body to trigger a fresh review or ask a maintainer to comment @clawsweeper re-review.

Risk before merge

  • [P1] The PR still lacks redacted live Nomad proof for doctor, warmup/run, list/status, stop, cleanup, and archive-sync behavior against a real cluster; unit tests and an environment-blocked smoke run do not prove the live transport path.
  • [P1] Merging this creates a built-in provider contract around Nomad config, ACL expectations, JSON jobspec templates, cleanup semantics, and the official Nomad SDK dependency before maintainers have explicitly accepted that surface.
  • [P1] Socket flagged the new Nomad SDK path through an archived transitive package, so maintainers need to accept that supply-chain risk or request a narrower client approach.

Maintainer options:

  1. Require live proof and dependency sign-off (recommended)
    Before merge, ask for redacted live Nomad terminal output or logs plus explicit maintainer acceptance of the Nomad SDK dependency and provider contract.
  2. Let a maintainer supply validation
    A maintainer with a real Nomad cluster can choose to provide the missing proof and accept the dependency risk themselves if contributor proof is not feasible.
  3. Pause until a cluster is available
    If nobody can validate a real Nomad path, pause this landing branch and keep the linked feature issue as the product discussion.

Next step before merge

  • [P1] Human review is needed because automation cannot supply the missing live Nomad proof or accept the SDK/provider contract on maintainers' behalf.

Security
Needs attention: No concrete credential leak was found, but the new Nomad SDK dependency path and live credentialed provider surface need maintainer security/supply-chain acceptance.

Review details

Best possible solution:

Land the Nomad provider only after redacted live-cluster proof and explicit maintainer acceptance of the SDK, ACL, jobspec, cleanup, and dependency contract.

Do we have a high-confidence way to reproduce the issue?

Not applicable; this is a new provider PR rather than a broken existing behavior report. The branch has offline tests and smoke wiring, but no high-confidence live Nomad proof was provided.

Is this the best way to solve the issue?

Unclear until the remaining merge gates are satisfied. The delegated-run shape matches the linked request, but live cluster proof and maintainer acceptance of the dependency/provider contract are still required before this is the best landing path.

AGENTS.md: found and applied where relevant.

Codex review notes: model internal, reasoning high; reviewed against 0ec69d642764.

Label changes

Label justifications:

  • P2: This is a normal-priority provider feature with meaningful user value but no current production regression.
  • merge-risk: 🚨 other: Merging commits Crabbox to a new built-in Nomad provider, SDK dependency, ACL/jobspec contract, and live cleanup behavior that CI alone does not settle.
  • rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🐚 platinum hermit.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR has test commands and an environment-blocked smoke result, but no redacted live Nomad terminal output, logs, recording, or artifact showing the changed behavior after the fix; contributors should redact IPs, API keys, phone numbers, non-public endpoints, and other private details, then update the PR body to trigger a fresh review or ask a maintainer to comment @clawsweeper re-review.
Evidence reviewed

Security concerns:

  • [medium] Accept the Nomad SDK dependency path — go.mod:31
    The PR adds github.com/hashicorp/nomad/api, and Socket flagged its dependency path through archived github.com/mitchellh/go-homedir; maintainers should explicitly accept that risk or request a narrower client.
    Confidence: 0.86

What I checked:

  • Repository policy applied: AGENTS.md was read fully and applied to the provider-neutral architecture and secret-handling checks for this provider addition. (AGENTS.md:1, 0ec69d642764)
  • Current main lacks Nomad provider: A current-main search for Nomad/HashiCorp/NOMAD returned no source or docs hits, so the PR is not redundant with existing main behavior. (0ec69d642764)
  • Canonical linked work: The PR closes the still-open Nomad delegated-run provider request, and GitHub search found this PR as the matching implementation candidate.
  • Provider shape: The branch registers Nomad as a delegated-run Linux provider with archive-sync and cleanup features, matching the requested provider family. (internal/providers/nomad/provider.go:21, 32349afae253)
  • Credential handling: The branch keeps the Nomad token out of persisted config and argv, reads it from the selected environment variable, and rejects repository-configured Nomad credential destinations when inherited credentials are present. (internal/cli/credential_provenance.go:288, 32349afae253)
  • Lifecycle and ownership checks: The run lifecycle creates or reuses a Nomad job, validates local/remote ownership metadata, performs delegated sync or workdir setup, and owns cleanup for one-shot jobs. (internal/providers/nomad/lifecycle.go:105, 32349afae253)

Likely related people:

  • coygeek: Current-main blame shows Coy Geek authored the provider backend/delegated archive-sync foundation used by this PR, and recent provider history includes Coder and provider matrix work. (role: recent adjacent delegated-provider contributor; confidence: high; commits: beb0dabaa97e, 1b5e52ddaf66, 0ec69d642764; files: internal/cli/provider_backend.go, internal/cli/delegated_archive_sync.go, internal/providers/all/all.go)
  • steipete: Peter Steinberger authored the AWS Lambda MicroVM delegated provider commit and adjacent provider documentation/table entries that are close patterns for a new delegated-run provider. (role: provider architecture and adjacent delegated-provider contributor; confidence: medium; commits: 0301236b2752; files: internal/providers/awslambdamicrovm, docs/providers/README.md)
  • Vincent Koc: Current-main git history shows Vincent Koc committed recent provider-area changes, including Coder/provider-backend-adjacent work. (role: recent provider-area committer; confidence: low; commits: beb0dabaa97e, 1b5e52ddaf66, 0ec69d642764; files: internal/providers/all/all.go, internal/cli/provider_backend.go, docs/providers/provider-metadata.json)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. merge-risk: 🚨 other 🚨 Merging this PR has meaningful risk outside the owned taxonomy. labels Jun 25, 2026
coygeek added 9 commits June 24, 2026 17:10
Reject reused Nomad leases when the requested workdir no longer matches the local ownership claim, so reruns cannot silently relabel and sync against the wrong remote directory. Also reject broad system workdirs and make the default image requirement apply only to image-backed drivers unless a custom jobspec template owns the task config.

Verification: go test ./internal/providers/nomad -run 'TestValidateConfig|TestJobspec|TestRunRejectsReusedLeaseWithDifferentWorkdir|TestRunNoSync|TestRunKeep|TestRunSync'

Verification: go test ./internal/providers/nomad ./internal/providers/all ./internal/cli
Escape trusted jobspec placeholder substitutions as JSON string fragments, prefer replacement allocations over stale terminal allocations, and honor the warmup timing JSON contract for Nomad delegated runs.

Verification: go test ./internal/providers/nomad -run 'TestJobspec|TestWarmup|TestSelectAllocation|TestValidateConfig|TestRunRejectsReusedLeaseWithDifferentWorkdir'

Verification: go test ./internal/providers/nomad ./internal/providers/all ./internal/cli
Avoid carrying the default image into non-image Nomad drivers such as raw_exec, and replace the GNU-only sleep infinity keepalive with a portable shell loop.

Verification: go test ./internal/providers/nomad -run 'TestJobspec|TestValidateConfig|TestWarmup|TestSelectAllocation|TestRunRejectsReusedLeaseWithDifferentWorkdir'

Verification: go test ./internal/providers/nomad ./internal/providers/all ./internal/cli
Drop Nomad-local helpers that no longer have call sites after the provider integration and review-fix passes. This keeps the provider surface smaller without changing runtime behavior.

Verification: rg -n 'formatScope\(|blank\(|newLeaseSlug\(|normalizeLeaseSlug\(' internal/providers/nomad -S

Verification: go test ./internal/providers/nomad
Keep repository-local Nomad YAML from changing credentialed job execution settings such as region, namespace, driver, task, placement, resources, and timeouts. This prevents an untrusted checkout from reusing inherited Nomad credentials while switching the job to a different isolation or placement boundary.
Wait briefly for a concurrent helper process exit when the readiness deadline fires so startup reports the real helper failure instead of racing into a misleading timeout. This keeps the timeout path intact while making the early-exit diagnostic deterministic.
Propagate Nomad one-shot cleanup failures after successful delegated runs instead of only printing a warning. This keeps successful non-keep runs from reporting exit 0 while a Nomad job or local claim remains active.
Return a nonzero run exit code when Nomad allocation exec fails at the transport layer. This keeps RunResult and timing JSON from reporting exitCode 0 for failed delegated runs.
Use Nomad's global region for empty region configuration and require the configured task to be running before allocation readiness reports success. This keeps local claims compatible with Nomad defaults and prevents exec attempts before the task is executable.
coygeek added 2 commits June 24, 2026 20:28
Include the configured Nomad address in the local ownership scope and remote ownership fingerprint. This prevents cleanup or reuse against another Nomad cluster from deleting claims for jobs that are still running elsewhere.
Keep the Nomad live smoke from reading repo-local config for address or token-env destinations. The script re-emits those values as explicit CLI flags while running with live credentials, so only environment variables or an explicit trusted CRABBOX_CONFIG path should influence them.
@clawsweeper clawsweeper Bot removed the merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. label Jun 25, 2026
Update the Nomad provider docs to match the current portable keepalive command and address-inclusive ownership scope used by the implementation.
@coygeek

coygeek commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Addressed the two actionable ClawSweeper docs findings in 32349af:

  • Updated the default job-spec example to the portable while :; do sleep 3600; done keepalive command.
  • Updated the ownership-scope docs to include the Nomad address alongside namespace, region, and task.

I cannot add real live Nomad proof from this environment because no live Nomad address/token config is available here. The Nomad SDK dependency acceptance remains maintainer-owned.

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 other 🚨 Merging this PR has meaningful risk outside the owned taxonomy. P2 Normal priority bug or improvement with limited blast radius. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add HashiCorp Nomad as a delegated-run provider

1 participant