Skip to content

feat(vast): add Vast.ai SSH lease provider#680

Open
coygeek wants to merge 22 commits into
openclaw:mainfrom
coygeek:vast-provider
Open

feat(vast): add Vast.ai SSH lease provider#680
coygeek wants to merge 22 commits into
openclaw:mainfrom
coygeek:vast-provider

Conversation

@coygeek

@coygeek coygeek commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #363

  • Add the Vast.ai direct SSH-lease provider with config/env/flag handling, provider registration, guarded auth, and provider matrix/docs/live-smoke coverage.
  • Implement Vast offer search, create/show/list/manage/destroy/SSH-key API flows, ownership labels, per-lease SSH key handling, release-action handling, recovery/rollback, and cleanup safety checks.
  • Address closeout review findings around documented Vast API payload shapes, v1 paginated instance listing, retained stopped claims, status-ready probes, numeric slug resolution, Tailscale preflight rejection, unsafe custom API URL components, and race-test timing stability.

Verification

  • go test ./internal/providers/vast ./internal/providers/all ./internal/cli && go vet ./...
  • go test -race ./...
  • npm ci --prefix worker
  • npm run format:check --prefix worker && npm test --prefix worker
  • node scripts/generate-provider-matrix.mjs && node scripts/check-provider-matrix.mjs && scripts/check-docs.sh
  • bash -n scripts/live-vast-smoke.sh
  • CRABBOX_LIVE=0 CRABBOX_LIVE_PROVIDERS=vast scripts/live-vast-smoke.sh -> classification=environment_blocked reason=CRABBOX_LIVE_not_enabled
  • CRABBOX_LIVE=1 CRABBOX_LIVE_PROVIDERS=vast CRABBOX_VAST_API_KEY= VAST_API_KEY= scripts/live-vast-smoke.sh -> classification=environment_blocked reason=VAST_API_KEY_missing
  • node --test scripts/live-vast-smoke.test.js
  • git diff --check
  • ~/.agents/skills/_openclaw/autoreview/scripts/autoreview --engine codex --model gpt-5.5 --thinking high --mode branch --base origin/main -> clean, no accepted/actionable findings

coygeek added 19 commits June 24, 2026 15:06
Add the first-party Vast provider configuration surface with env-only API-key sourcing, safe config output, provider flags, built-in registration, and offline placeholder backend contracts for later lifecycle plans.

Add focused tests for Vast config precedence, credential destination provenance, provider discovery, flag registration, validation, and secret redaction.
Add the provider-local Vast.ai HTTP client, request and response models, payload builders, redacted API error handling, and compact ownership label helpers for the upcoming lifecycle implementation.

Cover the client contract with fake HTTP tests for auth, redirects, offer search, instance creation and management, SSH-key methods, response decoding, redaction, and label ownership safety.
Add the Vast SSH-lease backend for acquire, resolve, list, release, touch, doctor, and cleanup using generated per-lease keys and strict cbx1 ownership labels.

Cleanup is guarded by local claims plus provider labels, release destroys by default, and lifecycle tests use a fake Vast API/readiness path without live credentials.
Use fileURLToPath for the Cloudflare Workers mock alias so Vitest resolves the local mock correctly from repository paths that contain spaces.
Respect the generic SSH user override when Vast provider defaults are applied, so provider-specific defaults cannot replace an operator-supplied SSH user. Add regression coverage for both the Vast backend config and embedded direct SSH backend config.
Keep cleanup enabled for possible partial Vast warmup failures, but do not let a pre-acquire not-found stop response convert a known capacity blocker into cleanup_failed. Add regression coverage for the no-lease cleanup path.
Keep config show from replacing an explicitly supplied generic SSH user with the Vast provider default. This keeps the displayed effective config aligned with the backend defaulting path.
Report the release action captured on the lease target instead of the current config, so stopped or kept Vast instances are not reported as destroyed. Also route cleanup of owned-looking instances without matching local claims through the safe missing-claim refusal path.
Give local fake-SSH probe tests the same headroom as other terminal-bound checks so transient scheduling delays do not fail unrelated provider verification.
Preserve persisted Vast release and key metadata when normal resolve refreshes local claim endpoints, so stop/keep leases do not drift back to destroy. Also allow status-only resolution for owned instances that do not yet expose an SSH endpoint.
Detach provider-owned Vast SSH keys before destroying the instance so normal release cleanup follows the same effective ordering as rollback. Add an order assertion to prevent the detach call from becoming a suppressed post-destroy no-op.
Let an explicitly supplied Vast release action override the persisted claim label during release and reporting. This keeps safety overrides such as --vast-release-action keep from being ignored on leases acquired with the default destroy policy.
Give controller subprocess tests more non-race scheduling headroom after repeated local failures in terminal-bound process lifecycle checks. This keeps provider verification from failing on unrelated launch-gate timing.
Encode Vast search and create requests using the documented API shapes, restore stored SSH keys on resolve, and keep stopped lease claims so retained instances can be reconciled or destroyed later.

Also widen a race-instrumented Apple VZ helper test timeout encountered during the final verification gate.
@clawsweeper

clawsweeper Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge. Reviewed June 25, 2026, 1:11 AM ET / 05:11 UTC.

Summary
The PR adds a built-in Vast.ai Linux SSH-lease provider with config/env/flags, provider registration, REST lifecycle code, docs, metadata, tests, and a guarded live-smoke script.

Reproducibility: not applicable. this is a feature PR rather than a reproducible failure in existing behavior. The relevant proof path is a successful real Vast provider lifecycle, and the PR body/comments currently show only static checks plus environment-blocked smoke cases.

Review metrics: 1 noteworthy metric.

  • Provider surface: 32 files changed, +4,606/-12. The PR crosses config, lifecycle, docs, scripts, and tests for a new billable provider, so proof and product acceptance matter before merge.

Root-cause cluster
Relationship: fixed_by_candidate
Canonical: #363
Summary: This PR is the candidate implementation for the open Vast.ai direct SSH-lease provider request.

Members:

Proposal only: this assessment does not dispatch repair, suppress jobs, mutate sibling items, close, or merge anything.

Merge readiness
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🐚 platinum hermit
Result: blocked until real behavior proof is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P1] Add redacted terminal output, logs, linked artifacts, or a recording showing successful real Vast doctor, warmup/provision, status/list, SSH run, stop/release, and cleanup.
  • Get maintainer confirmation that Vast should ship as a built-in provider rather than leaving the linked request as an open product decision.

Proof guidance:

  • [P1] Needs real behavior proof before merge: The PR body and comments do not include successful after-fix live Vast lifecycle output; contributor should add redacted terminal/log output and update the PR body for re-review.

Risk before merge

  • [P1] No successful redacted live Vast lifecycle proof is present, so fake API tests and static CI do not prove billable instance creation, SSH readiness, list/status, release, and cleanup against Vast.ai.
  • [P1] Merging would add a new built-in billable provider, CLI/config surface, docs page, and live-smoke expectation while the linked provider request still needs maintainer product acceptance.
  • [P1] The diff handles API keys, provider-selected endpoints, SSH keys, and destructive cleanup; green CI cannot settle the real credential and lifecycle boundary.

Maintainer options:

  1. Require Live Vast Lifecycle Proof (recommended)
    Ask the contributor to add redacted output or logs showing doctor, warmup/provision, status/list, run over SSH, stop/release, and cleanup against a real Vast account before merge.
  2. Pause For Built-In Provider Acceptance
    If maintainers have not accepted Vast as first-party surface, keep this PR paused while Add Vast.ai as a direct GPU SSH-lease provider #363 carries that product/API decision.
  3. Accept Provider Surface Deliberately
    Maintainers may choose to own the new billable provider risk, but that should be an explicit acceptance of the credential, cost, cleanup, and support implications.

Next step before merge

  • [P1] Maintainer/product review and contributor live-provider proof are required; no narrow automated code repair is established.

Security
Cleared: No concrete static security or supply-chain defect was found in the inspected diff; the remaining security-boundary risk is live credential/lifecycle proof and maintainer acceptance.

Review details

Best possible solution:

Land only after maintainers accept Vast as a built-in provider and the PR includes redacted successful live Vast lifecycle proof; otherwise keep #363 as the product-decision home.

Do we have a high-confidence way to reproduce the issue?

Not applicable; this is a feature PR rather than a reproducible failure in existing behavior. The relevant proof path is a successful real Vast provider lifecycle, and the PR body/comments currently show only static checks plus environment-blocked smoke cases.

Is this the best way to solve the issue?

Unclear for merge as-is: the adapter shape matches Crabbox's SSH-lease contract and static review found no blocking line-level defect, but the best landing path still needs maintainer acceptance of the built-in provider surface and successful live-provider proof.

AGENTS.md: found and applied where relevant.

Codex review notes: model internal, reasoning high; reviewed against 0ec69d642764.

Label changes

Label justifications:

  • P2: This is a normal-priority provider expansion with meaningful but bounded blast radius.
  • merge-risk: 🚨 security-boundary: The PR adds a credentialed provider endpoint, API-key handling, SSH-key handling, and destructive provider operations.
  • merge-risk: 🚨 other: Static CI cannot prove the new billable Vast.ai provider safely creates, reaches, lists, releases, and cleans up real instances.
  • rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🐚 platinum hermit.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body and comments do not include successful after-fix live Vast lifecycle output; contributor should add redacted terminal/log output and update the PR body for re-review.
Evidence reviewed

What I checked:

  • Repository policy applied: AGENTS.md was read fully; its provider-neutral architecture boundary and credential guidance are relevant to this provider PR. (AGENTS.md:13, 0ec69d642764)
  • Current main does not implement Vast: A current-main tree search for Vast provider paths, docs, or live-smoke files returned no matches, so the PR is not obsolete on main. (0ec69d642764)
  • Latest release does not ship Vast: v0.33.0's provider import barrel includes existing GPU SSH providers such as nvidia-brev and runpod but no Vast import. (internal/providers/all/all.go:40, 966e99599db4)
  • Provider contract supports the chosen shape: The current provider-authoring docs say SSH-lease providers should populate an SSH target and core owns sync, command streaming, and related workflows. (docs/features/provider-authoring.md:35, 0ec69d642764)
  • PR registers the requested provider shape: The PR registers vast as a Linux ProviderKindSSHLease provider with ssh, crabbox-sync, cleanup, and CoordinatorNever. (internal/providers/vast/provider.go:23, 88f7d7f6c235)
  • Prior URL safety concern is addressed: The latest PR head rejects credentials, query strings, and fragments in vast.apiUrl, and the client requires HTTPS except loopback localhost. (internal/providers/vast/provider.go:87, 88f7d7f6c235)

Likely related people:

  • coygeek: Current-main history shows recent NVIDIA Brev, DigitalOcean, and provider-spec maintenance touching patterns closest to this Vast provider work. (role: recent direct-provider and provider-contract contributor; confidence: high; commits: 975d70260d7b, aff04c9f19a3, beb0dabaa97e; files: internal/providers/nvidiabrev/provider.go, internal/providers/digitalocean, internal/providers/runpod/provider.go)
  • Peter Steinberger: Recent history includes the provider backend registry, provider matrix/docs, RunPod REST SSH maintenance, and AWS Lambda MicroVM provider surfaces adjacent to this PR. (role: recent provider architecture and registry contributor; confidence: medium; commits: 494f3a4d779e, fdbf69be8b4b, 7323324381a2; files: internal/cli/provider_backend.go, internal/providers/all/all.go, docs/providers/README.md)
  • Yossi Eliaz: Git history shows the original RunPod provider commit, which is the closest existing GPU SSH-lease provider pattern. (role: introduced closest GPU SSH provider; confidence: medium; commits: cef985b92482; files: internal/providers/runpod, internal/providers/all/all.go)
  • Vincent Koc: Recent RunPod cleanup-safety work is relevant to reviewing another direct SSH-lease provider with destructive cleanup behavior. (role: recent adjacent provider lifecycle contributor; confidence: medium; commits: 35b728746e4e; files: internal/providers/runpod)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 other 🚨 Merging this PR has meaningful risk outside the owned taxonomy. labels Jun 25, 2026
coygeek added 2 commits June 24, 2026 21:21
Drop unused Vast helper functions left behind after the pagination and release-lifecycle fixes. CI runs deadcode with test reachability, and these helpers were no longer referenced by the provider implementation or tests.
@coygeek

coygeek commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

@clawsweeper clawsweeper Bot added the merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. label Jun 25, 2026
Reject query strings and fragments in the Vast API URL so configured provider endpoints cannot smuggle secret-bearing URL components or break path construction. Document the endpoint contract and cover query and fragment inputs in provider validation tests.
@coygeek

coygeek commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 other 🚨 Merging this PR has meaningful risk outside the owned taxonomy. merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. P2 Normal priority bug or improvement with limited blast radius. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Vast.ai as a direct GPU SSH-lease provider

1 participant