Skip to content

feat(fal): add SSH lease provider#693

Open
coygeek wants to merge 13 commits into
openclaw:mainfrom
coygeek:fal-ssh-lease-provider
Open

feat(fal): add SSH lease provider#693
coygeek wants to merge 13 commits into
openclaw:mainfrom
coygeek:fal-ssh-lease-provider

Conversation

@coygeek

@coygeek coygeek commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Closes #694

Summary

Adds a direct fal Compute SSH lease provider with Crabbox-managed SSH and sync support, local-claim-owned cleanup, and the fal-ai alias.

This also adds fal provider docs, provider matrix metadata, benchmark category generation, and an opt-in guarded live smoke script that defaults to no live provider mutation unless CRABBOX_LIVE=1 and fal is selected.

Lifecycle Safety

  • Uses the fal Compute Idempotency-Key header with the Crabbox lease ID for creates.
  • Retries ambiguous create responses with the same idempotency key and requires a provider instance ID before proceeding.
  • Avoids persisting unrecoverable empty-provider-ID recovery claims.
  • Rolls back callback/readiness/bootstrap failures unless --keep explicitly owns a failed-acquire recovery claim.
  • Cleanup removes provider-absent local claims before TTL filtering and only deletes local-claim-owned fal instances.

Verification

  • gofmt -w $(git ls-files '*.go') && git diff --check
  • go test ./internal/providers/fal
  • bash -n scripts/live-fal-smoke.sh
  • node scripts/generate-provider-matrix.mjs --check
  • scripts/check-docs.sh
  • node --test scripts/live-fal-smoke.test.js scripts/live-smoke.test.js
  • CRABBOX_LIVE= CRABBOX_LIVE_PROVIDERS= FAL_KEY= CRABBOX_FAL_KEY= scripts/live-fal-smoke.sh
  • go build -trimpath -o bin/crabbox ./cmd/crabbox
  • outside-source CLI proof for crabbox providers --json, doctor --provider fal --json, list --provider fal --json, and cleanup --provider fal --dry-run with blank fal credentials
  • go vet ./...
  • go run golang.org/x/tools/cmd/deadcode@v0.45.0 -test ./...
  • scripts/check-go-coverage.sh 90.0
  • go test -race ./...
  • autoreview --mode branch --base origin/main: clean, no accepted/actionable findings
  • GitHub Actions CI run https://github.com/openclaw/crabbox/actions/runs/28198853668 passed, including Go job https://github.com/openclaw/crabbox/actions/runs/28198853668/job/83532759285.

The local no-live smoke returned classification=environment_blocked reason=CRABBOX_LIVE_not_enabled, so no live fal resources were created during local verification.

coygeek added 11 commits June 25, 2026 10:36
Add fal provider registration, env-only credential loading, non-secret configuration, and a schema-backed Compute API client skeleton.

Keep fal lifecycle support out of the advertised run surface until the PLAN-02 backend wires acquire/list/release behavior.
Restore PLAN-01's advertised fal surface to an SSH lease provider with ssh, crabbox-sync, and cleanup capabilities.

Keep lifecycle behavior deferred behind explicit PLAN-02 errors so discovery matches the plan without silently performing unsupported resource operations.
Implement fal Compute lease acquire, resolve, list, touch, release, and cleanup flows with local-claim ownership checks.\n\nAdd offline lifecycle tests for rollback, recovery claims, status-only resolve, persisted SSH endpoints, and destructive-operation safeguards.
Document the direct fal Compute SSH lease provider, add provider matrix metadata, and regenerate the provider category surfaces.

Add an opt-in live smoke script with no-live defaults, credential gating, classified external blockers, redaction, cleanup attempts, and dispatcher coverage.
Build the acquired fal lease target after the SSH readiness probe so fallback port discovery is reflected in the returned lease as well as the persisted claim.

Add regression coverage for a configured SSH port corrected by the readiness probe.
Retry ambiguous fal Compute creates with the same lease idempotency key before proceeding so a recoverable provider id is required for local claim ownership.

Avoid persisting empty-provider-id recovery claims when idempotent reconciliation cannot return a fal instance id, and cover both retry success and retry failure paths.
Check claimed fal instances for provider absence before TTL eligibility so cleanup removes stale local claims for instances deleted outside Crabbox.

Add regression coverage for a non-expired local claim whose fal instance is already gone.
@clawsweeper

clawsweeper Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper review: did not complete due to Codex infrastructure failure. Reviewed June 25, 2026, 5:11 PM ET / 21:11 UTC.

Summary
Review failed before ClawSweeper could summarize the requested change.

Reproducibility: unclear. The review failed before ClawSweeper could establish a reproduction path.

Review metrics: none identified.

Merge readiness
Not assessed.
Failure reason: timeout.

This is a ClawSweeper/Codex infrastructure failure, not a PR readiness or patch-quality verdict.
Keep any merge decision on the normal maintainer review path until ClawSweeper can complete a fresh review.

Risk before merge

  • [P1] No close action taken because the review did not complete.

Maintainer options:

  1. Decide the mitigation before merge
    Retry the Codex review after fixing the execution failure.
  2. Pause or close
    Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge

  • Review did not complete, so no work-lane recommendation was made.
Review details

Best possible solution:

Retry the Codex review after fixing the execution failure.

Do we have a high-confidence way to reproduce the issue?

Unclear. The review failed before ClawSweeper could establish a reproduction path.

Is this the best way to solve the issue?

Unclear. Retry the review first so ClawSweeper can evaluate the actual issue and fix direction.

AGENTS.md: unclear because the file could not be read completely.

Codex review notes: model internal, reasoning high; reviewed against 184979d683b7.

Label changes

Label changes:

  • remove rating: 🧂 unranked krab: Current review failed before PR readiness was assessed, so no rating label should remain.
  • remove status: 📣 needs proof: Current PR status no longer selects a status label.
  • remove P2: Current review triage priority is none.
  • remove merge-risk: 🚨 other: Current PR review selected no merge-risk labels.
Evidence reviewed

What I checked:

  • failure reason: timeout.
  • codex failure detail: Codex review failed for this PR: Codex process timed out after 1199999ms.
  • codex stderr: No stderr captured.
  • codex stdout: {"type":"thread.started","thread_id":"019f008d-6c20-7721-9899-b774c8c3e1c3"}.
  • process error code: ETIMEDOUT.

Likely related people:

  • unknown: Codex failed before it could trace repository history. (role: review did not complete; confidence: low)
How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Drop the unused fal inventoryDoctorResult wrapper so the CI deadcode gate passes.
@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 other 🚨 Merging this PR has meaningful risk outside the owned taxonomy. labels Jun 25, 2026
Give the Go CI job enough wall-clock budget to finish the full vet, deadcode, race, all-modules, coverage, and build gauntlet. The previous 15-minute cap canceled the coverage step after the preceding checks had already passed.
@coygeek

coygeek commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Context on the Go CI timeout update in 605caed0:

The first red Go run was a real static-analysis issue, not runtime: deadcode reported internal/providers/fal/core.go:35:6: unreachable func: inventoryDoctorResult. That helper was an unused wrapper around the shared provider doctor result helper, so it was removed in 039ebdf3.

After that fix, the next Go run passed the earlier gates: formatting, go vet ./..., go run golang.org/x/tools/cmd/deadcode@v0.45.0 -test ./..., go test -race ./..., and scripts/test-go-modules.sh. It then entered scripts/check-go-coverage.sh 90.0 and was canceled by the Go job's 15-minute workflow timeout while coverage was still running. The log showed GitHub cancellation during Coverage, not a coverage assertion failure.

To verify the diagnosis before changing CI, I ran the same coverage command locally. It completed successfully with Go core coverage 91.3% >= 90.0%; the long-running package work, including internal/providers/xcpng, finished cleanly. Based on that, the timeout update was scoped to the Go job only: timeout-minutes: 15 -> 30, giving the existing full Go gauntlet enough wall-clock budget without weakening any check.

Result: the replacement GitHub Actions run passed: https://github.com/openclaw/crabbox/actions/runs/28198853668. The Go job passed in 14m17s, including Deadcode, Test, Test all Go modules, Coverage, and Build: https://github.com/openclaw/crabbox/actions/runs/28198853668/job/83532759285.

@clawsweeper clawsweeper Bot removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 other 🚨 Merging this PR has meaningful risk outside the owned taxonomy. P2 Normal priority bug or improvement with limited blast radius.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a fal.ai Compute provider

1 participant