Skip to content

Replace OpenRouter HTTP backend with opencode agent#4653

Merged
jurgenwerk merged 66 commits into
mainfrom
cs-11034-software-factory-replace-openrouter-backend-with-opencode
May 12, 2026
Merged

Replace OpenRouter HTTP backend with opencode agent#4653
jurgenwerk merged 66 commits into
mainfrom
cs-11034-software-factory-replace-openrouter-backend-with-opencode

Conversation

@jurgenwerk
Copy link
Copy Markdown
Contributor

@jurgenwerk jurgenwerk commented May 5, 2026

Summary

The --agent openrouter path used to be a separate, hand-rolled HTTP loop with a fairly different toolset from the Claude path. This PR routes it through the same opencode engine the Claude side uses, so both backends now behave the same way and run the same tools.

What changes

  • Both agents share one toolset. Instead of "Claude has native filesystem tools but the OpenRouter agent has its own custom wrappers," everyone now gets the standard Read / Write / Edit / Bash / Glob / Grep plus our factory tools (validators, schema lookup, signal_done, etc.).
  • --openrouter-api-key flag added. Pass your OpenRouter key directly, or set OPENROUTER_API_KEY. With no key, the request is proxied through the realm server and billed to your boxel credits.
  • A pile of redundant "wrapper" tools is goneread_file, write_file, fetch_transpiled_module, etc. They only existed because the old OpenRouter path didn't have native filesystem access. With opencode driving both backends, they're dead weight.
  • Skills simplified. No more "if Claude do X, if OpenRouter do Y" branching in the prompt skills — same instructions for both.

What stays the same

  • The Claude path still works exactly as before.
  • factory:go CLI surface is unchanged apart from the new optional flag.
  • All validation steps, signals, and the issue loop are untouched.

🤖 Generated with Claude Code

Foundation for replacing OpenRouterFactoryAgent with an opencode-driven
agent. Lays in the deps and a typed skeleton so the design is
reviewable in isolation; the full runtime — subprocess + relay server
+ MCP wrapper + signal capture — is deferred to a follow-up so this
commit does not break the working `--agent openrouter` HTTP path.

What's in:
  - `@opencode-ai/sdk` and `opencode-ai` (1.14.34) added as
    devDependencies. `opencode-ai` is a tiny stub with per-platform
    optionalDependencies (esbuild-style) so `pnpm install` resolves
    just the matching binary into `node_modules/.bin/opencode`. No
    manual `npm i -g` step.
  - `@modelcontextprotocol/sdk` added as a direct dep — the future
    MCP server wrapping `FactoryTool[]` will use it.
  - pnpm-workspace.yaml: opencode publishes ~hourly, so the
    `minimumReleaseAge: 1440` filter rejects every release. Added
    opencode + each platform variant to `minimumReleaseAgeExclude`.
  - root package.json: `opencode-ai` added to `onlyBuiltDependencies`
    so its postinstall (which symlinks the platform binary) is
    allowed to run.
  - `src/factory-agent/opencode.ts`: typed skeleton + design notes.
    Documents the target architecture (subprocess, dual auth, MCP for
    factory tools, `permission.external_directory: 'deny'` for path
    scoping). `run()` throws so a misconfigured wiring can't
    accidentally route here.

What's pending (CS-11034 follow-up):
  - Relay HTTP server for proxy auth mode.
  - In-process / subprocess MCP server wrapping FactoryTool[].
  - Event-stream consumption + DONE / CLARIFICATION signal capture.
  - Wiring change in factory-issue-loop-wiring.ts.
  - --openrouter-api-key CLI flag.
  - Deletion of OpenRouterFactoryAgent + the 5 OpenRouter-only tools
    + the CLAUDE_FILTERED_FACTORY_TOOLS filter, once opencode is
    verified end-to-end.

Lint + types clean. No behavior change for any existing run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Host Test Results

    1 files  ±0      1 suites  ±0   1h 48m 20s ⏱️ + 2m 9s
2 654 tests ±0  2 639 ✅ ±0  15 💤 ±0  0 ❌ ±0 
2 673 runs  ±0  2 658 ✅ ±0  15 💤 ±0  0 ❌ ±0 

Results for commit 5b99722. ± Comparison against earlier commit e886cc4.

Realm Server Test Results

    1 files  ±0      1 suites  ±0   11m 16s ⏱️ -20s
1 326 tests ±0  1 326 ✅ +1  0 💤 ±0  0 ❌  - 1 
1 405 runs  ±0  1 405 ✅ +1  0 💤 ±0  0 ❌  - 1 

Results for commit 5b99722. ± Comparison against earlier commit e886cc4.

Drops the direct-HTTP `OpenRouterFactoryAgent` for an opencode-backed
`OpencodeFactoryAgent` so `--agent openrouter` runs benefit from the
same native fs / Bash / Glob / Grep tools the Claude path already uses.
Both backends now go through native tools, so the five MCP wrappers
that existed purely to compensate for the prior fs-less OpenRouter
path can be retired.

What's in
=========

`OpencodeFactoryAgent` (`src/factory-agent/opencode.ts`):
  - Spawns `opencode` via `createOpencodeServer` from `@opencode-ai/sdk`
  - Two auth modes:
    1. `--openrouter-api-key <key>` (or env `OPENROUTER_API_KEY`) →
       opencode is configured with a direct OpenRouter provider via
       `@ai-sdk/openai-compatible`, key in the Authorization header.
    2. No key → spin up a tiny localhost relay HTTP server in-process
       that translates OpenAI-style requests into the realm-server
       `_request-forward` shape (`{ url, method, requestBody }`) and
       posts via JWT-authed `BoxelCLIClient.authedServerFetch`.
       Burns boxel tokens — same as the prior proxy mode.
  - In-process HTTP MCP server (`@modelcontextprotocol/sdk` Streamable
    HTTP transport) exposes the surviving 7 factory tools (5
    validators + `signal_done` + `request_clarification`) to the
    opencode subprocess.
  - Path scoping via opencode's built-in
    `permission.external_directory: 'deny'` + workspace `cwd` —
    replaces the `buildWorkspaceScopedCanUseTool` callback shape on
    the Claude side.
  - DONE / CLARIFICATION signals: tool symbols don't survive JSON-RPC,
    so the MCP server tags them `factory:done` / `factory:clarification`
    and the agent's signal-capture hook matches on the tag.
  - Lazy-imported via dynamic `import()` because the SDK is ESM-only
    and the test runner is CommonJS via ts-node.

CLI flag: `--openrouter-api-key <key>` plumbed through
`FactoryEntrypointOptions` → `IssueLoopConfig` →
`CreateLoopAgentConfig`. Falls back to env `OPENROUTER_API_KEY` when
absent, then to proxy mode when both are missing.

Wiring: `--agent openrouter` now dispatches to `OpencodeFactoryAgent`.
Label is `openrouter (model=…, mode=direct|proxy)`. Requires
`workspaceDir` (errors if missing — opencode mounts it as `cwd`).

Deletions
=========

  - `src/factory-agent/openrouter.ts` — direct-HTTP class retired.
  - `read_file`, `write_file`, `search_realm`,
    `fetch_transpiled_module`, `run_command` builders in
    `factory-tool-builder.ts` — replaced by native fs / `boxel
    read-transpiled` / `boxel search` (via Bash) / `boxel run-command`
    (via Bash).
  - `CLAUDE_FILTERED_FACTORY_TOOLS` filter in `claude-code.ts` — no
    longer needed since neither backend wants the retired tools.
  - `tests/factory-agent-schema-boundary.test.ts` — the
    Zod-vs-JSON-Schema boundary it asserted no longer applies (the
    OpenRouter side now goes through MCP rather than raw HTTP).

Skill updates
=============

`.agents/skills/software-factory-bootstrap/SKILL.md` and
`.agents/skills/software-factory-operations/SKILL.md`: dropped the
`(Claude backend)` / `(OpenRouter backend)` dichotomy throughout.
Single description: native `Read` / `Write` / `Edit` / `Bash` for
workspace files, `Bash` + `boxel read-transpiled` / `boxel search`
for realm reads.

Tests
=====

  - `factory-tool-builder.test.ts`: dropped tests for retired tools;
    expanded the regression-list to assert all 5 OpenRouter-only +
    5 structured-update tools are absent.
  - `factory-agent-claude-code.test.ts`: rewrote
    "filters factory tools that have native or boxel CLI alternatives"
    to assert the surviving filter (registry-sourced shadow tools
    still excluded; native-fs-replaced tools no longer asserted
    because they don't exist any more).
  - 137 targeted unit tests pass (factory-tool-builder,
    factory-agent-claude-code, factory-prompt-loader,
    factory-context-builder, issue-loop). Lint + types clean.

Honest caveats — needs your verification
========================================

I cannot end-to-end verify the opencode subprocess or the relay
server without an OpenRouter API key + a live realm server. The
shapes type-check and the unit tests pass, but the first run with
`pnpm factory:go --agent openrouter ...` is the real test. The
likely failure modes I can think of:

  - opencode SDK's `session.prompt` may expect a slightly different
    `model` shape than I used.
  - The relay server's response Content-Type passthrough may not be
    exactly what the AI SDK expects (it expects JSON for
    chat/completions; I forward whatever the proxy returns).
  - The MCP HTTP transport may require a specific path or session
    handshake I haven't accounted for.

Each of those is a quick fix once observed empirically. Run with
`--debug` and share output if anything misbehaves.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jurgenwerk jurgenwerk changed the title CS-11034: replace OpenRouter backend with opencode (WIP — foundation only) CS-11034: replace OpenRouter backend with opencode May 5, 2026
@jurgenwerk jurgenwerk changed the base branch from main to retire-structured-update-tools May 5, 2026 12:32
@jurgenwerk jurgenwerk changed the title CS-11034: replace OpenRouter backend with opencode CS-11034: replace OpenRouter HTTP backend with opencode (native fs for both backends) May 5, 2026
jurgenwerk and others added 24 commits May 6, 2026 11:55
…factory-replace-openrouter-backend-with-opencode

# Conflicts:
#	packages/software-factory/.agents/skills/software-factory-bootstrap/SKILL.md
#	packages/software-factory/.agents/skills/software-factory-operations/SKILL.md
#	packages/software-factory/src/factory-tool-builder.ts
#	packages/software-factory/tests/factory-tool-builder.test.ts
…ncode relay

Replaces the in-process HTTP relay that software-factory's opencode
agent spun up in passthrough mode with a dedicated realm-server
endpoint that accepts a verbatim OpenAI chat-completions body. Same
JWT + credit-strategy + streaming pipeline as `_request-forward`,
just without the OpenAI-→-`{url, method, requestBody}` re-shape.

- Extracted the shared `pendingCostPromises` barrier, cost-deduction
  scheduler, and SSE streaming handler into `lib/proxy-forward.ts`;
  both `_request-forward` and the new endpoint use it so per-user
  cost ordering stays consistent.
- New `POST /_openrouter/chat/completions` handler pins the upstream
  to OPENROUTER_CHAT_URL server-side, looks up the destination config
  from the existing `proxy_endpoints` whitelist, and forwards
  verbatim. Streaming is driven by `stream: true` in the body.
- opencode now points its OpenAI-compatible provider's `baseURL` at
  `<realmServerUrl>/_openrouter` and stamps the realm-server JWT
  (fetched once via the new `BoxelCLIClient.getServerToken`) into the
  static Authorization header. The 7-day JWT TTL means a single
  ticket run is in no danger of outlasting it.
- Removes `startProxyRelayServer`, `buildRelayProviderConfig`, and
  the unused `OPENROUTER_CHAT_URL` constant from software-factory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`client.session.prompt` is documented as blocking until the model +
tool loop completes, but in opencode SDK 1.14.34 the HTTP response
isn't reliably flushed once the loop exits — the server-side session
goes idle (`session.idle` is published, snapshot cleanup runs) but
our await keeps hanging indefinitely, never reaching the teardown in
`finally`.

Subscribe to the per-directory event bus before creating the session,
fire the prompt without awaiting it directly, and drive completion
off the first `session.idle` event matching our sessionId. Also break
on `session.error` so an upstream auth/length/abort failure doesn't
keep us stuck. The prompt's return value was unused (DONE /
CLARIFICATION signals come back through the MCP server), so dropping
the await on it costs nothing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous attempt to drive completion off `client.event.subscribe`
also turned out unreliable in opencode 1.14.34: the SSE stream
established mid-session and silently missed the eventual
`session.idle` event published when the loop exited (the realm-server
log clearly showed multiple successful 200 responses to
`/_openrouter/chat/completions` followed by opencode emitting
`session.idle publishing` — but our parent never saw the event and
hung indefinitely).

Switch to polling `client.session.status` every 750ms instead.
SessionStatus is a discriminated `idle | retry | busy` union, so the
only edge to handle is the post-create-but-pre-prompt window where
status is still `idle`. The polling helper waits for the first
non-idle observation before treating a subsequent `idle` as "the loop
finished", with a 30-minute upper bound so a hung session can't trap
the factory loop forever.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…in opencode poll

The previous polling pass caught the right-shaped HTTP loop running
but the response body was always `{}`, so we never saw the session
transition. Two issues, both apparent only against a live opencode
1.14.34:

1. `/session/status` requires the same `directory` query that
   `session.create` was called with. Without it the response is
   unconditionally empty regardless of session state.
2. Empirically the endpoint returns *currently busy* sessions only —
   when a session goes idle, its entry disappears from the map
   instead of staying with `type: 'idle'`.

Pass the workspace directory through to the status call, and treat
"session disappeared after first being seen busy" as completion in
addition to an explicit `type: 'idle'` reading.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
opencode internally normalizes the `directory` query through its own
realpath before storing the session. On macOS this rewrites
`/var/folders/...` (the path Node hands us via tmpdir) to
`/private/var/folders/...` — they're the same directory but distinct
strings, and opencode's `/session/status?directory=...` filter is a
straight string match. Result: the status endpoint returned `{}` for
every poll because we were asking about `/var/...` while opencode had
filed the session under `/private/var/...`.

Pre-resolve `workspaceDir` once with `realpathSync` and use the
canonical form for both `session.create` and `session.status`. The
"session disappears from the status map after observed busy" branch
added in the previous commit is still useful as a belt-and-suspenders
completion signal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ability

Both prior completion signals turned out unusable in opencode 1.14.34:
the `session.prompt` HTTP response hangs after the loop exits, the
`/session/status` map returns `{}` regardless of session state (live
probing on a busy session confirmed this even with the canonical
directory query), and `client.event.subscribe` was unreliable in
earlier attempts.

The only signal that's both present and reliable in this version is
`session.list[id].time.updated`: opencode bumps it on every
`message.part.delta` and step transition, so we can watch it through
a 5-second stability window. When `time.updated` hasn't moved for
5s, the model + tool loop is idle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous commit added a 5s stability window on `time.updated` polling
to detect opencode loop completion. That penalized the happy path
where the model calls `signal_done` (or `request_clarification`) —
the captured signal was already available, but we kept polling for
5s anyway.

Race the captured-signal promise against `waitForSessionIdle`. The
MCP server resolves a one-shot promise the moment it sees a
`factory:done` / `factory:clarification` tag come back, so the normal
flow returns with zero added latency. The polling stays as a
fallback for the (rare) case where the model exits the loop without
emitting either signal — and even there the stability window drops
from 5s to 2s now that polling is only the fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tions

Two issues observed in a real run:

1. The 2s `time.updated` stability window false-positive-detected the
   session as idle while the model was actively streaming. Empirically
   `time.updated` only ticks at step boundaries (not on every
   `message.part.delta`), and opus can sit 30+ seconds between steps.
   Bump the window to 60s. The polling is only the fallback for when
   the model exits without `signal_done` / `request_clarification` —
   the captured-signal race short-circuits this on the happy path, so
   the wider window doesn't add latency to normal runs.

2. `opencode.close()` returns synchronously but doesn't wait for the
   spawned subprocess to actually exit. The next iteration's
   `createOpencodeServer` then hits EADDRINUSE on opencode's fixed
   port 4096 and the whole factory:go run dies. Add a
   `waitForPortFree(4096, 5000ms)` poll after `opencode.close()` so
   the next iteration can bind cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`opencode.close()` from the SDK only sends SIGTERM via `proc.kill()`,
and the precompiled opencode 1.14.34 binary apparently ignores it
— so the spawned subprocess keeps running and continues holding the
fixed port 4096 long after we ask it to close. Iteration 2 of
factory:go (and every iteration after) then hits EADDRINUSE because
the SDK has no force-kill path we can call into.

Replace the blind "wait for port to free" loop with a wait-then-
escalate strategy: 1s graceful window for SIGTERM to land, then look
up the listening PID via `lsof` and `process.kill(pid, 'SIGKILL')`,
then a short post-kill wait so the kernel releases the port before
the next iteration spawns its own opencode. The lsof path is
best-effort — if it can't run we fall through and let the next
iteration surface a clearer EADDRINUSE error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t POST

Two issues blocking real opencode runs:

1. Opus was treating "tools for reading and writing the workspace
   mirror" as descriptive language and never invoked the actual fs
   tools — iterations spent minutes generating reasoning text and
   produced zero files. Replace the vague paragraph with an explicit
   tool inventory (`Write`, `Read`, `Edit`, `Glob`, `Grep`, `Bash`,
   plus the factory-specific tools) and an explicit instruction to
   call tools rather than describe what would be written.

2. opencode 1.14.34 occasionally rejects the very first
   `/session/{id}/message` POST on a freshly-spawned subprocess with
   `TypeError: fetch failed`, which kills every iteration after the
   first. Wrap the call in a 500ms retry so the flake is hidden.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Iter 1 succeeded (`Agent returned 1 tool call(s)`, validation passed,
SIGKILL-on-teardown clean). Iter 2's opencode subprocess came up but
both `session.prompt` attempts immediately failed with `TypeError:
fetch failed`, indicating the subprocess died right after the
"server listening" line. The polling helper's `client.session.list`
call then also threw `fetch failed`, propagated past the agent.run()
boundary, and crashed the entire factory:go process — losing all
progress from iter 1.

Wrap `session.list` in try/catch inside `waitForSessionIdle`, count
consecutive failures, and return cleanly after 5 in a row (~3.75s).
That lets the agent surface "0 tool calls" for the failed iteration
while the outer issue loop keeps going to iter 3.

Doesn't fix the underlying opencode flakiness — that's a separate
chase. But it stops a single bad iteration from killing the whole
run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spawning a fresh opencode subprocess per agent.run() was the actual
root cause of the iter-2+ `TypeError: fetch failed` cascade. opencode
1.14.34 is shaped to be a long-lived server with many short-lived
sessions; rapid restarts (close → SIGTERM → SIGKILL → respawn → first
prompt) hit failure modes around the SQLite-backed session store and
the `/session/{id}/message` handler.

Refactor:

- `OpencodeFactoryAgent` now holds the MCP server, opencode
  subprocess, and SDK client as instance state. `ensureStarted()`
  spawns them lazily on first `run()` and is idempotent on
  subsequent calls. `run()` only creates a new session, fires the
  prompt, waits for completion, and clears its per-run hooks — no
  teardown.
- The MCP server's `onToolCall` / `onSignal` callbacks now forward
  into a swappable `currentHooks` pointer that `run()` swaps in / out
  around each session, so a single long-lived MCP server can serve
  many sequential runs.
- Add optional `close(): Promise<void>` to the `LoopAgent` interface.
  `factory-issue-loop-wiring` calls it in a `finally` after
  `runIssueLoop` returns (or throws), so the opencode subprocess and
  MCP server are torn down exactly once per factory:go run instead
  of N times.
- Drop the per-run `session.prompt` retry — the flake it papered
  over was almost entirely caused by the rapid-restart pattern this
  refactor eliminates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…delta

`proxySSE` accumulated bytes into `buffer` and then called a helper
`extractSSELines(buffer)` to split off complete lines. The helper
reassigned its *local* `buffer` parameter as it consumed lines, so
the caller's `buffer` in `proxySSE` never got trimmed. On every new
network read we re-extracted and re-dispatched every previously
seen line.

For the AI Bot path through `_request-forward` this was latent but
not catastrophic. For opencode driving the new
`/_openrouter/chat/completions` passthrough it was fatal: the model
saw each text delta repeated N times and concatenated them, so
assistant text came out as `"II'll processI'll process this b..."`
and tool-call argument JSON became `{"command{"command{"command":
"ls...`. Every tool invocation rejected with `Invalid input ... JSON
parsing failed`, so the model never managed to call Write or any
other native tool — explaining the full day of "model thinks for
minutes but produces zero files" runs.

Inline the line-splitting in `proxySSE` so the trailing incomplete
fragment is kept in `buffer` and complete lines are dispatched
exactly once.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Factory:go ran silent for minutes between "Inner iteration N/8" and
the next visible log line, even when the realm-server log showed
the model actively making chat completions. Adds three sources of
visible progress so users don't fly blind:

- Log every factory MCP tool call as it lands (run_lint, run_tests,
  signal_done, ...) with a short arg summary.
- Subscribe to opencode's per-directory event stream for *logging
  only* — surfaces native opencode tool invocations (Read / Write /
  Bash / Edit) and the eventual `session.idle` / `session.error`
  events. Best-effort: any SSE failure is swallowed; completion
  detection still uses `time.updated` polling, this stream isn't on
  the critical path.
- Heartbeat log every 15s from the polling loop showing elapsed
  time and whether the session is still actively updating or idle
  pending the stability window.

Also log the session id when a new session is created so users can
grep the opencode log file by that id if they need deeper detail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
opencode bundles a default tool kit (read, write, edit, bash, glob,
grep, plus webfetch, task, todowrite, skill, question, invalid).
Every tool definition is included in every chat completion, so the
extras cost tokens on every model call and adds up dramatically over
a multi-step session. They also give the model more ways to "stall"
on unhelpful actions (writing TODOs about what to do instead of
doing it, dispatching subtasks, etc.).

Pass an explicit `tools` map to `session.prompt` enabling only the
six we actually use (`read` / `write` / `edit` / `bash` / `glob` /
`grep`) and disabling the rest. Factory MCP tools are unaffected —
they ride the MCP transport, not this whitelist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hat we send

PR #4652 (`retire-structured-update-tools`) updated `system.md` and the
two SKILL files to drop references to removed wrappers, but left the
four ticket prompt templates and the seed-issue description still
telling the model to call `write_file`, `read_file`, `search_realm`,
`update_issue`, `create_knowledge`. The opencode model would then try
to invoke tools that don't exist, fall back to `Edit` or `Bash`, and
in test-18 just gave up and called `signal_done` after creating zero
files.

- `prompts/bootstrap-implement.md` / `ticket-implement.md` /
  `ticket-iterate.md` / `ticket-test.md`: replace dead tool refs with
  the real native opencode tools (`Write`, `Read`, `Edit`, `Glob`,
  `Bash`) and the surviving factory MCP tools (`signal_done`,
  validators). Add an explicit "calling `signal_done` without writing
  the required files is a failure" line so opus stops bailing early.

- `src/factory-seed.ts`: include the full `brief.content` in the seed
  issue description (was only embedding `brief.contentSummary` — a
  one-line blurb, way too thin to drive a bootstrap from). Replace the
  stale "mark this issue done via `update_issue`" footer with explicit
  Write + signal_done instructions.

- `src/factory-agent/opencode.ts`: when `--debug`, log the full system
  prompt, user prompt, enabled native opencode tools, and enabled
  factory MCP tools right before sending each `session.prompt` so we
  can see exactly what the model gets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gent context

`mapCardToSchedulableIssue` was extracting only `status`, `priority`,
`blockedBy`, `order`, `summary`, `issueType` from the realm card —
silently dropping `description` and `acceptanceCriteria`. The
scheduler-issued objects then flowed all the way into the agent
prompt as `{{issue.description}}`, which rendered to an empty string.

That meant every iteration the model got "## Current Issue\n\nID:
...\nSummary: Process brief and create project artifacts\n\nDescription:\n\n##
What to Create\n..." — i.e. the brief content the seed creator went
to the trouble of embedding in the issue description was being thrown
away before the model ever saw it. The model bootstrap-completed with
zero artifacts because it had no idea what the brief was about.

Pass `description` and `acceptanceCriteria` through the mapper so they
reach the agent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The opencode SDK's `createOpencodeServer` spawns the binary with no
`cwd` option set on the child process — the subprocess inherits the
parent's cwd. The model's native fs tools (`Read` / `Write` / `Edit`)
then resolve relative paths against THAT inherited cwd, not the
workspace.

Result: when the model called `Read("Projects/sticky-note.json")` the
permission log showed it trying
`/Users/jurgen/development/boxel/packages/software-factory/Projects/sticky-note.json`
(the directory `pnpm factory:go` was invoked from) rather than
`/private/var/folders/.../boxel-factory-workspaces/<realm>/Projects/sticky-note.json`
(the actual workspace). Reads always failed (those files don't exist
under the source tree), the model never managed to inspect existing
state cleanly, and never went on to call `Write` for the artifacts.

Pre-resolve the workspace's canonical realpath once and `process.chdir`
into it across the `createOpencodeServer` call, so the subprocess
forks with the right cwd. Restore the parent's cwd in `finally` —
once the child has forked, the parent's cwd doesn't matter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ation

The test-21 run finally produced cards but two of them showed
"Card Error: Expected array for field value tags" in Boxel because
the model wrote `tags: "a, b, c"` (a comma-separated string) where
the schema declared `tags` as a `containsMany StringField` (a real
array). Same root cause for any other guess-the-shape failure: the
model never called `get_card_schema`, despite the prompt saying it
should, because the instruction was buried in the skill file.

- `prompts/bootstrap-implement.md`: promote schema fetching to a
  mandatory **Step 0** at the top of the Instructions block, with the
  three required calls written out verbatim and an explicit warning
  that `containsMany` fields must be JSON arrays in `attributes`. The
  rest of the steps come after, framed as "now create the artifacts in
  this order so relationship targets exist when referenced."

- `.agents/skills/boxel-file-structure/SKILL.md`: add a "containsMany
  Attributes" section right next to the existing "linksToMany
  Relationships" section, with a `["a", "b", "c"]` example and the
  exact error string the wrong shape produces. Add a matching row to
  the Common Mistakes table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…:go is slow

Captures the architecture-level differences between the previous direct
OpenAI tool-use loop and the current opencode SDK runtime, four
hypotheses for the observed slowdown (model emitting fewer tool_calls
per step, per-step prompt overhead, opencode bookkeeping, model
thinking time), and what we'd need to instrument to verify them.

Note: token counts and step counts in the doc are explicitly flagged as
estimates, not measurements.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds opt-in observability for the realm-server's openrouter
passthrough so we can measure the per-step prompt overhead and
tool-call distribution that drives the factory's wall-clock cost.

When FACTORY_INSTRUMENT_PATH is set, every chat-completion request
through /_openrouter/chat/completions writes a JSONL record with:

- request: model, system_chars, tools_count + tools_chars,
  messages_count + messages_chars, total input chars, rough token
  estimates, parallel_tool_calls value, tool_choice
- response: tool_calls count, tool call names, assistant text size,
  finish_reason, provider usage tokens, TTFB, duration

`pnpm factory:stats <jsonl>` (in software-factory) summarises the
log: model identity, distribution of tool_calls per assistant
response, per-step prompt overhead, ground-truth usage tokens, and
wall-clock per request. Designed to answer the four hypotheses in
OPENCODE_PERFORMANCE.md (H1 tool-call batching, H2 prompt overhead,
H3 wall-clock, H4 model identity).

Off by default; no behaviour change unless FACTORY_INSTRUMENT_PATH is
set. The streaming hook is plumbed through handleStreamingRequest as
an optional StreamingInstrumentation parameter so the existing
_request-forward caller is unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…prompt dump

Two unrelated observability fixes for the opencode-backed agent.

1. fetch-error unwrapping. opencode 1.14.34 surfaces every network
   failure as undici's TypeError("fetch failed"), with the real cause
   (ECONNREFUSED, UND_ERR_HEADERS_TIMEOUT, AbortError, etc.) buried in
   `err.cause`. The previous String(err) threw all of that away, so a
   `session.prompt rejected: TypeError: fetch failed` warning was
   indistinguishable between "subprocess crashed", "upstream timed
   out", and "we cancelled the request". Adds describeFetchError()
   that walks up to four levels of cause chain and reports the codes,
   plus probeOpencode() that hits the subprocess's /app endpoint with
   a short timeout to report alive/dead at the moment of failure.
   Both session.prompt and session.list catch sites now use them. A
   startup info line points at the live opencode log directory so the
   operator knows where to tail when warnings fire.

2. drop --debug prompt dump. The `--- system prompt (N chars) ---`
   block printed the entire merged system prompt (~10K+ tokens worth
   of skills) on every iteration. With concurrent loggers writing to
   stdout, the multi-line message racing on writes produced garbled
   output where the `factory-agent-opencode` prefix got chewed
   mid-word. Removed; the existing `Agent backend: ...` line is
   enough.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both tools route through the realm-server's prerender sandbox, which
loads modules from the realm filesystem. Before this change, an agent
that wrote a .gts via Write and then immediately called
run_evaluate({ path }) hit a 404 because the realm hadn't seen the
file yet — the orchestrator only synced between iterations.

Fix: ToolBuilderConfig gains an optional syncWorkspace callback. The
issue-loop wiring passes the same syncWorkspaceToRealm function the
orchestrator uses for post-signal_done validation. run_evaluate and
run_instantiate now call syncWorkspace() first; on failure they
return a typed error result without attempting the realm call.

Cost: ~500ms-2s on first call after writes, near-zero on subsequent
calls since boxel-cli's sync is mtime-aware. The orchestrator's
post-signal_done sync is now a no-op when nothing changed.

run_parse, run_lint, and run_tests already read directly from the
workspace, so they're unaffected.

The software-factory-operations skill's "Self-Validation" section is
updated to clarify which tools sync (run_evaluate, run_instantiate)
and which don't (run_lint, run_parse, run_tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jurgenwerk jurgenwerk requested review from a team and habdelra May 8, 2026 12:44
@habdelra
Copy link
Copy Markdown
Contributor

habdelra commented May 8, 2026

Can you do an end to end test with the software factory using open router? (the prompt for that is in packages/software-factory/tests/end-to-end-test-prompt.md)

A handful of small fixes shaken out of end-to-end runs that all
point in the same direction: when the agent gets a vague error or
runs into a tool description it misreads, it goes on token-burning
fishing expeditions. Tighten the prompts and tool surface so
common confusions get answered immediately.

prompts/system.md
- Pin the workspace path concept: cwd is the local workspace
  mirror, every path passed to a tool must be workspace-relative,
  absolute `/Users/...` / `~` / app-support paths are blocked. Stops
  the agent from inventing fake macOS paths on the first turn.
- New top rule: stay in the target realm. Skills are authoritative
  for patterns; don't `boxel file ls` / `boxel search` against the
  base, software-factory, experiments, or catalog realms looking
  for examples.

prompts/ticket-implement.md
- Step 1: scope `boxel search` / `boxel read-transpiled` calls
  explicitly to the target realm.

factory-tool-builder.ts (run_instantiate description)
- Warn explicitly against passing a `Spec/...json` path or any
  card whose `meta.adoptsFrom.module` is a base-realm URL. Specs
  adopt from `https://cardstack.com/base/spec`, and the prerender
  refuses cross-origin module loads — calling run_instantiate with
  the Spec path always fails. Tell the agent to omit `path`
  instead so the tool discovers Specs and exercises their
  linkedExamples.

instantiate-execution.ts (`prepareExampleInstance`)
- Pre-flight origin check: when the resolved moduleUrl's origin
  differs from the target realm's, return a friendly error
  pointing at the correct usage instead of letting the prerender
  bubble up "moduleUrl origin (https://cardstack.com) does not
  match realmUrl origin (http://localhost:4201)".

factory-agent/opencode.ts (`summarizeSessionError`)
- Extract the response body from opencode's APIError payload. The
  HTTP statusText is usually generic ("Forbidden Request") while
  the actual reason ("Insufficient credits", model unavailable,
  upstream auth failure, ...) lives in `data.responseBody`. Parse
  the common shapes (`{errors:[...]}` from the realm server,
  `{error:{message}}` from OpenRouter) and append a `body=...`
  field to the one-line log. Falls back to a 200-char truncation
  for non-JSON bodies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jurgenwerk
Copy link
Copy Markdown
Contributor Author

@habdelra forgot to note that, yes, I always do that when I submit PRs like these 👍 Confirmed the factory run result with both agents.

…ools

Three references in `.agents/skills/boxel-development/references/`
are pre-loaded into every implementation-issue system prompt
(`factory-skill-loader.ts:106-113`). They still talked about
factory tools that have been retired:

- `dev-spec-usage.md` told the agent to call `create_catalog_spec`
  and `write_file`. Now it tells the agent to call
  `get_card_schema({module:'https://cardstack.com/base/spec', name:'Spec'})`
  for the live schema and write the JSON natively. Added a
  complete required-shape JSON example, a reminder about the
  dotted `linkedExamples.0` key form, and an explicit "don't run
  `run_instantiate` on the Spec itself — the prerender refuses
  cross-origin module loads" note (the trap we hit in earlier runs).

- `dev-realm-search.md` framed everything around the retired
  `search_realm` tool. Rewrote the intro to point at
  `boxel search --realm <target-realm-url>` (target realm only)
  and added a reinforcing "do not query other realms" line to back
  up PR #4653's prompt-level rule. The "Discovering Available
  Fields" section's `run_command` JSON payload became a one-line
  `get_card_schema(module, name)` call.

- `dev-qunit-testing.md` had a single `read_file` mention for
  TestRun inspection — swapped for native `Read` + `Glob`.

All query-format / spec-type / QUnit-pattern content is preserved
unchanged. Only the tool names changed.

Closes CS-10520's Step 6 / Skills-aligned acceptance item via
CS-10613 Phase C step 8 (audit always-loaded references). Full
CS-10613 scope (Phase A: new boxel-api skill; Phase B: dedupe
across factory / boxel-cli / skills-realm; Phase D: loader
keyword-map updates) is follow-up work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surfaced while auditing CS-10883 / CS-10613 leftovers — none of the
following had a live consumer in src/ or tests/:

- `FILE_ACTION_TYPES` (factory-agent/types.ts) — declared but never
  imported.
- `FactoryAgent` interface (the original "declarative" agent shape) —
  only implementer was `MockFactoryAgent`, which itself had no callers
  outside its own definition and the barrel re-export.
- `MockFactoryAgent` class (factory-agent/mocks.ts) and its
  re-export from factory-agent/index.ts.

Removing those also lets the surrounding doc-comment in types.ts
stop referencing `factory-agent.ts` / `factory-agent-tool-use.ts`,
neither of which exist anymore.

Also touched a few historical comments that name retired tools:
- types.ts: model-pin docblock said "broke every `write_file`" — the
  pin still matters but the tool that hit the truncation is now
  native `Write`.
- types.ts: `ClaudeCodeAgentConfig.workspaceDir` mentioned realm I/O
  going through `search_realm` / `run_command` MCP tools; both are
  retired. Replaced with the current list (`get_card_schema`, the
  five validators, the control signals).
- claude-code.ts: two block comments mentioning `read_file` /
  `write_file` / `search_realm` as the "old shims" — rewrote in
  terms of the current native fs + MCP split.

Still vestigially wired (not touched here):
`VALID_ACTION_TYPES` / `VALID_REALMS` / `AgentActionType` /
`ActionRealm` / `AgentAction` are all still consumed by the iterate
prompt's `previousActions` slot. The slot is always passed `[]`
today, but unwiring it touches `factory-prompt-loader`, the
ticket-iterate.md template, and several tests — out of scope for a
small dead-code sweep. Worth its own pass later.

`pnpm test:node`: 330/330.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@lukemelia lukemelia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that we refer to issues and tickets interchangeably. Should be standardize on "issues"?

@jurgenwerk
Copy link
Copy Markdown
Contributor Author

@lukemelia That's fine with me. Given we're naming it "issue tracker" in #4735 then your suggestion makes sense. I'll update it

@habdelra (despite the merge conflicts) this is ready for review again, if you can check it that would be great. I'll fix the conflicts in the meanwhile

jurgenwerk and others added 17 commits May 12, 2026 09:22
…-backend-with-opencode

# Conflicts:
#	package.json
#	packages/realm-server/tests/index.ts
#	packages/software-factory/.agents/skills/software-factory-operations/SKILL.md
#	packages/software-factory/prompts/bootstrap-implement.md
#	packages/software-factory/src/factory-seed.ts
#	packages/software-factory/src/factory-tool-builder.ts
The post-merge `pnpm build:plugin` output diverged from the committed
synopsis (the `boxel realm watch` description and arg list shifted to
the new "subcommands manage watch processes" form). CI's
`Verify Boxel CLI plugin synopsis is fresh` check caught it.
Regenerated via `pnpm build:plugin`.
CI's "Verify plugin version bumped when synopsis changed" check
required a version bump alongside the regenerated
plugin/skills/realm-sync/SKILL.md from the previous commit, so
marketplace consumers see the update.
CS-10666 (under CS-10613).

`packages/boxel-cli/.agents/skills/boxel-api/SKILL.md` — new canonical
home for Boxel platform API knowledge. Covers:

- `boxel search` / `client.search()` — federated search across one or
  more realms, with the full query syntax (`type`, `eq`, `contains`,
  `range`, `every`/`any`/`not`, `sort`, `page`, CodeRef matching, common
  mistakes).
- `boxel realm create` / `client.createRealm()` — provisioning a new
  realm, including the `waitForReady` default that polls
  `/_readiness-check` for you.
- `client.waitForReady()` — standalone readiness polling.
- A "when to use what" decision matrix mapping common goals to the right
  CLI command or `BoxelCLIClient` method.
- Boundary statements pointing at sibling skills (`boxel-development`,
  `boxel-file-structure`, `boxel-sync`, `boxel-command`) so this skill
  doesn't try to cover everything.

Auth deliberately not documented. boxel-cli owns auth internally —
consumers don't see JWTs, and `BoxelCLIClient` handles tokens,
refresh, and 401 retries through `ProfileManager`. The skill only
tells consumers "use `BoxelCLIClient`; don't roll your own `fetch`."
Diverges from CS-10666's "covers auth model" acceptance line — the
auth machinery is an implementation detail, not API surface.

Retired `packages/software-factory/.agents/skills/boxel-development/
references/dev-realm-search.md` — its substantive query-syntax content
moved into the new skill. Removed from `ALWAYS_LOAD_REFERENCES` and
`REFERENCE_KEYWORD_MAP` in `factory-skill-loader.ts` so the loader no
longer tries to read a file that doesn't exist.

`pnpm test:node`: 345/345 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ev refs

Picks up where the earlier `boxel-api` skill commit left off — CS-10613's
remaining mechanical / content work.

Relocations (git mv preserves history):

- `boxel-development`, `boxel-file-structure` — into monorepo root
  `.agents/skills/`. These describe Boxel card development idioms in
  general; they aren't software-factory-specific. The factory's
  `factory-skill-loader.ts` already walks the monorepo root as a fallback
  dir, so it still picks them up.
- `boxel-sync`, `boxel-track`, `boxel-watch`, `boxel-repair`,
  `boxel-restore`, `boxel-setup` — into `packages/boxel-cli/.agents/skills/`.
  These document interactive CLI workflows for humans using Claude Code
  on a synced workspace. They aren't loaded by the factory agent
  (CLI_ONLY_SKILLS filter remains in place), so moving them out of
  `packages/software-factory/.agents/skills/` puts them next to the code
  that implements those commands.

Rewrites:

- `software-factory-operations/SKILL.md` — dropped the dual "Claude
  backend vs OpenRouter backend" branches. OpenRouter no longer exposes
  the old factory tools (read_file / write_file / search_realm /
  fetch_transpiled_module / run_command); both backends now use native
  fs (`Read`/`Write`/`Edit`/`Glob`/`Grep`) plus the `boxel` CLI through
  `Bash`. Realm-side reads section now points at the `boxel-api` skill
  for the full search query syntax and at the `boxel-command` skill for
  prerendered host commands.
- `dev-qunit-testing.md` — swap the lone `read_file` reference for
  native `Read` + `Glob`.
- `dev-spec-usage.md` — swap `create_catalog_spec` + `write_file` for
  the live `get_card_schema` introspection + native `Write` flow.
  Added the required-shape JSON example, the dotted `linkedExamples.0`
  key form (the indexer rejects the array form), and the explicit
  warning to never call `run_instantiate` on the Spec file itself
  (its module lives in the base realm, the prerender enforces
  same-origin module loads, the call always fails — a trap factory
  runs have walked into).

No code changes — the loader's existing fallback chain (primary:
`packages/software-factory/.agents/skills/`, fallback:
`MONOREPO_ROOT/.agents/skills/`) resolves moved skills correctly. Tests
in `factory-skill-loader.test.ts` (43 cases) still pass — they construct
synthetic skill dirs and don't depend on the real layout.

Closes CS-10666; advances CS-10613's content-rewrite phase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ons skill

Prettier interpreted `+ \`Write\`` at the start of a continuation line
as a list bullet and reflowed the paragraph into a broken pseudo-list.
Reworded to avoid the leading `+` token.
Copilot review surfaced five issues:

1. Factory skill loader couldn't find `boxel-api` / `boxel-command`
   because `packages/boxel-cli/.agents/skills/` wasn't in its search path,
   so the cross-references from `software-factory-operations` to those
   skills were dead pointers in the factory agent's prompt.

   Fix: add `packages/boxel-cli/.agents/skills/` to the loader's fallback
   chain (between the package primary dir and the monorepo root). Also
   auto-load `boxel-api` and `boxel-command` from `DefaultSkillResolver` —
   the agent always needs the realm-search query syntax and host-command
   failure modes, so both belong in the always-loaded set.

   Side effect: `CLI_ONLY_SKILLS` is now removed. It was a defensive
   filter for the old layout where every skill lived in the factory's
   own `.agents/skills/` and could be picked up by accident. After the
   relocations the CLI skills live in `packages/boxel-cli/` and are
   never auto-loaded by the resolver — a knowledge-article author can
   explicitly opt in via a `skill:boxel-sync` tag, which is the
   deliberate path.

   Tests updated: the four "excludes CLI-only skills" cases became two
   cases — one verifying free-text keyword matching does NOT pull in CLI
   skills (still true via the resolver's hard-coded auto-load set), one
   verifying a knowledge-tag opt-in DOES include them (new behavior).

2. `software-factory-operations/SKILL.md` listed `boxel run-command`
   under "read-only `boxel` CLI commands." It dispatches to arbitrary
   host commands and isn't read-only in general — rewrote the section
   to separate read-only inspection commands from `run-command` and to
   note that its safety is "as safe as the named command."

3. Same SKILL.md called the Self-Validation section "no side effects,"
   but `run_evaluate` / `run_instantiate` / `run_tests` sync the
   workspace to the realm before invoking the prerenderer. Renamed the
   section to "in-memory results" and called out the realm push as an
   explicit side effect, while clarifying that `run_lint` / `run_parse`
   do run entirely in-process.

4. `dev-spec-usage.md` opened by saying Specs adopt from
   `https://cardstack.com/base/spec#Spec` (fragment form), then the JSON
   example used `module: "https://cardstack.com/base/spec", name: "Spec"`.
   Reworded the prose to match the module+name shape used in the example
   so a hurried reader doesn't copy `spec#Spec` into a CodeRef.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These two API surfaces aren't agent-facing — the software-factory
provisions the target realm in factory-target-realm.ts before the agent
loop starts, and waitForReady is called from the orchestrator. With
boxel-api now auto-loaded into every factory agent prompt, the
realm-creation prose was just dead weight in context.

Kept federated search (the agent does use boxel search via Bash) and
the "when to use what" table. Pointed readers needing
createRealm/waitForReady at boxel-cli/src/api.ts or boxel realm create
--help.
Six skills moved into `packages/boxel-cli/.agents/skills/` earlier in this
PR turned out to describe commands that don't exist in the monorepo's
boxel-cli:

- `boxel-sync` — no top-level `boxel sync`; only `boxel realm sync`
- `boxel-track` — no `boxel track`
- `boxel-watch` — no top-level `boxel watch`; only `boxel realm watch`
- `boxel-restore` — no `boxel restore`; closest is `boxel realm history`
- `boxel-repair` — no `boxel repair-realm` / `boxel repair-realms`
- `boxel-setup` — no `boxel setup`; setup happens via `boxel profile add`

These skills came from the standalone `cardstack/boxel-cli` GitHub repo,
which has a much richer CLI surface than the monorepo's slimmed-down
fork. The actual `boxel` here is six commands: `profile`, `file`,
`realm`, `run-command`, `search`, `read-transpiled`. Anyone reading
those skills and trying to run the documented commands would get
"unknown command" — they're worse than no docs.

Deleted the six skill directories. `boxel-api` and `boxel-command`
stay — they describe `boxel search` and `boxel run-command`, which
actually exist.

Loader cleanup that fell out:

- `SKILL_PRIORITY` no longer references the deleted skills.
- Removed the explanatory comment fragment about knowledge-tag opt-in
  for CLI skills, since CLI skills are gone.
- The "knowledge article tag opt-in" test now uses hypothetical names
  (`custom-extension`, `another-domain-skill`) instead of the deleted
  `boxel-sync` / `boxel-repair`. The test still verifies the resolver
  honors knowledge tags; it just doesn't pretend any specific CLI
  skill is a valid target anymore.

Tests: 40/40 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`packages/software-factory/.claude/CLAUDE.md` (783 lines) and
`packages/software-factory/AGENTS.md` (224 lines) both described the
standalone `cardstack/boxel-cli` GitHub repo's rich CLI surface —
`boxel sync`, `boxel track`, `boxel watch`, `boxel restore`,
`boxel repair-realm`, `boxel skills`, `boxel share`, `boxel gather`,
`boxel realms`, `boxel stop`, `boxel edit`, top-level `boxel list` /
`boxel history` / `boxel status`. None of those exist in this monorepo's
slimmed-down `boxel-cli`. The AGENTS.md also still referred to a
"dark-factory" / "guidance-tasks" workspace setup and a "one hour to
produce demo" priority that predates the current factory architecture.

Rewrote both as thin, accurate pointers (~40 lines each):

- Available commands (`pnpm factory:go`, `pnpm test:node`, `pnpm lint`).
- Pointers to README.md for architecture.
- Description of the three-directory skill loader chain.
- The architectural boundary: boxel-cli owns the entire Boxel API
  surface; the factory imports `BoxelCLIClient` and never calls fetch()
  against a realm directly.
- Key source-file map: entrypoint, issue loop, workspace-fs, agent
  backend, tool builder.

Cleanup that fell out:

- `scripts/smoke-tests/factory-skill-smoke.ts`: replaced the
  `boxel-sync` / `boxel-track` / `boxel-watch` / `boxel-restore` /
  `boxel-repair` / `boxel-setup` list with the actual currently-loadable
  skills (`boxel-api`, `boxel-command`).
- `tests/factory-skill-loader.test.ts` (budget tests): synthetic
  "lower-priority" fixture renamed from `'boxel-sync'` to
  `'low-priority-test-skill'` so it's clear the name is just a test
  placeholder, not a real skill.
- `tests/factory-agent-claude-code.test.ts` (registry-leak test):
  synthetic `'boxel-sync'` registered-tool example renamed to
  `'sample-registered-tool'` for the same reason.

`packages/software-factory/docs/phase-1-plan.md` and `phase-2-plan.md`
still mention the deleted skills — left alone because they're archived
planning docs (snapshots of intended state at a point in time, not
authoritative guidance). `tests/factory-tool-registry.test.ts` mentions
`boxel-sync` in its retired-tools list, which is correct historical
record of CS-10883's retirements.

Tests: 40/40 skill-loader, 84/84 across the three affected suites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Fadhlan's review feedback on PR #4756: having two skill directories
inside boxel-cli (`.agents/skills/` and `plugin/skills/`) is redundant.
`plugin/skills/` is the canonical marketplace-distributed location used
by the boxel-cli Claude Code plugin — that's where these belong.

- `git mv` boxel-api and boxel-command into `packages/boxel-cli/plugin/skills/`.
- Drop the now-empty `packages/boxel-cli/.agents/skills/` directory.
- Update `factory-skill-loader.ts` fallback chain to point at
  `packages/boxel-cli/plugin/skills/`.
- Bump `plugin.json` version 0.1.2 → 0.1.3 (CI's synopsis-bump coupling
  check requires it whenever plugin/skills content changes).
- Update CLAUDE.md and AGENTS.md doc paths.

End users who install the boxel-cli plugin now get `boxel-api` and
`boxel-command` skills alongside the existing CLI-command skills, which
is the right semantics: both teach Boxel platform usage.

`pnpm build:plugin` does not touch these new skill directories (they're
not registered Commander commands), and `pnpm build:skills` ignores
them (not in the `boxel-skills` ALLOWLIST), so CI's
`git diff --exit-code -- plugin/skills` check still passes.

Tests: 65/65 across factory-skill-loader + factory-context-builder.
Per Fadhlan's review feedback: `boxel-file-structure` was effectively
duplicated. The version at `packages/boxel-cli/plugin/skills/` (hand-
committed in PR #4632 / CS-10900) and the one I moved to root
`.agents/skills/` in this branch were near-identical — only formatting
and frontmatter description differed.

Dropped the root copy. The factory loader's fallback chain
(software-factory primary → boxel-cli/plugin/skills → root .agents/skills)
resolves `boxel-file-structure` via plugin/skills instead. Verified
locally: `SkillLoader.load('boxel-file-structure')` returns the plugin
copy with the same content the factory was already getting.

Not consolidating `boxel-development` despite the surface similarity —
the two versions actually have different lineages and content:

- plugin version is auto-generated by `pnpm build:skills` from
  cardstack/boxel-skills@v0.0.22 (it's in the ALLOWLIST). Designed for
  end-user Claude Code marketplace consumers writing cards manually.
- root .agents/skills version is hand-maintained for the factory
  agent. Includes factory-specific guidance (`get_card_schema`-driven
  Spec writing in `dev-spec-usage.md`, references to factory tools),
  plus `dev-file-def.md` and `dev-qunit-testing.md` references that the
  upstream lacks.

Replacing the plugin copy with the factory version would either get
clobbered by the next build:skills run or fail CI's
`git diff --exit-code -- plugin/skills` check. The proper alignment
path is to upstream the factory-friendly changes into
cardstack/boxel-skills first; out of scope for this PR.

Tests: 40/40 factory-skill-loader pass.
Software factory: relocate skills + cleanup + rewrite for current factory tools
Per Luke's review feedback on PR #4653: "issue" and "ticket" were used
interchangeably across the factory code, prompts, and tests. The card
type itself has been `Issue` since CS-10520 Phase 2's `Ticket → Issue`
rename, so the leftover "ticket" usages were just stale prose.

Renames (`git mv`, history preserved):

- `prompts/ticket-implement.md` → `prompts/issue-implement.md`
- `prompts/ticket-iterate.md`   → `prompts/issue-iterate.md`
- `prompts/ticket-test.md`      → `prompts/issue-test.md`

Updated loader calls in `factory-prompt-loader.ts` to reference the new
names.

Prose / comment cleanups in:

- `prompts/system.md` — 5 "ticket" → "issue" replacements ("ticket
  descriptions", "the ticket requires", "When the ticket says…", "the
  ticket description", "Every ticket must include…").
- `src/factory-tool-builder.ts` — `signal_done` tool description now
  says "Signal that the current issue is complete".
- `src/factory-skill-loader.ts` — reference-keyword-map doc comment
  ("When an issue doesn't match any keyword…").
- `src/factory-context-builder.ts` — comment ("Resolve skill names
  from issue + project context").
- `src/validators/noop-step.ts` — doc comment ("via child issues").

Test fixtures (synthetic strings, but bringing them in line for
readability):

- `tests/factory-test-realm.test.ts` — `slug: 'my-ticket'` →
  `slug: 'my-issue'`, and the resulting `Validations/test_my-ticket-N`
  testRunId strings → `test_my-issue-N`.
- `tests/factory-skill-loader.test.ts` — `id: 'Knowledge Articles/
  ticket-knowledge'` → `issue-knowledge`, `skill:ticket-skill` →
  `skill:issue-skill`.

Not touched:

- `docs/phase-1-plan.md`, `docs/phase-2-plan.md` — archived planning
  docs. "Ticket" there correctly refers to the pre-rename type that
  phase 2 renamed to `Issue` (the rename itself is documented in those
  files, so changing it would erase the history).

Tests: 152/152 across factory-prompt-loader + factory-skill-loader +
factory-test-realm + factory-tool-builder pass.
The description claimed `boxel file` supports search, but it only has
delete / list / lint / read / touch / write subcommands. A factory:go
run caught the bug — the agent read the description, tried
`boxel file search ...`, and got `error: unknown command 'search'`
before falling back to other tools.

Federated search lives at top-level `boxel search`. Updated the
description to match what `file` actually does.

`pnpm build:plugin` produces no change (top-level command
descriptions aren't part of the regenerated synopsis blocks), so no
plugin.json bump.
@jurgenwerk jurgenwerk merged commit 0af717a into main May 12, 2026
77 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants