Add terminal-product real-agent lab parity so Image Skill can migrate from its bespoke sim

## Summary

Image Skill cannot replace its bespoke real-agent sim with Mimetic yet because Mimetic has strong generic Observer/browser/lab primitives, but it does not yet provide a terminal-product real-agent study lane with the proof artifacts and safety contract Image Skill needs.

This issue should be enough for a fresh Codex session to implement the missing adapter/lab path without reading any private context. Treat this as public adopter feedback from attempting to use `mimetic-cli@0.7.0` as a third-party OSS package.

## Public black-box evaluation notes

Evaluated as an outside npm package, without inspecting private downstream source:

- `mimetic-cli@0.7.0` is published on npm and exposes the `mimetic` binary.
- `mimetic init --yes` works in a clean disposable project and scaffolds:
  - committed source plane: `mimetic/`
  - ignored runtime plane: `.mimetic/`
- `mimetic run --dry-run --json` produces a verified contract bundle.
- `mimetic lab run cua-browser --dry-run --json --no-open` produces a contract bundle and honestly labels it as contract-only.
- `mimetic run --app-url http://127.0.0.1:<live-port> --json` can produce a passing live browser bundle with desktop/mobile screenshots and traces; `mimetic verify --run <run> --json` passes.
- `mimetic feedback issue --run latest --repo owner/repo --format markdown` produces a public-safe issue draft without GitHub mutation.
- `mimetic codex app-server --help` shows a Codex app-server actor surface, but that is not the same as the Image Skill real-agent sim lane.

One concrete verifier gap found during black-box testing:

- A blocked live browser run against an unavailable loopback target failed closed, which is good, but its bundle referenced screenshot paths that were not written, causing `mimetic verify` to fail on missing local evidence artifacts. For failed/blocked runs, Mimetic should either persist placeholder screenshots or omit screenshot artifact references so failed runs remain structurally verifiable evidence.

## Blocking Image Skill use case

Image Skill's current bespoke sim proves a real autonomous agent study, not just browser UI proof:

```text
real Codex agent runtime + E2B substrate + vague product task + public Image Skill surfaces
-> command-scoped runtime auth
-> PTY terminal stream in Observer
-> no private repo access
-> no direct creative-provider/payment/deploy/GitHub/database credentials
-> capped no-spend product attempt
-> durable terminal/substrate/cost/product proof bundle
-> verified public-safe review artifacts
```

The important distinction: Image Skill is testing whether an autonomous agent can discover and use a CLI/product surface from public materials. It is not merely testing whether a browser can click a local web app.

## Required new capability

Add a generic Mimetic lane for **terminal-product real-agent studies**. Suggested names are flexible, but the missing concept is:

- subject type: `terminal-product`, `cli-product`, or similar
- actor type: `codex-exec` or generic terminal agent runtime
- substrate: E2B shell/terminal with PTY capture
- observer stream kind: terminal, with live PTY tail and immutable replay data
- product adapter hooks: product-specific scoring/feedback/no-spend proof without forking Mimetic's core

A public-safe example lab should be possible along these lines:

```yaml
schema: mimetic.lab.v2
id: image-skill-fresh-agent-discovery
title: Image Skill fresh-agent discovery
subject:
  source: terminal-product
  product:
    name: image-skill
    publicSurfaces:
      - https://image-skill.com
      - https://image-skill.com/llms.txt
      - https://image-skill.com/skill.md
      - npm:image-skill
actors:
  - type: codex-exec
    persona: autonomous-creative-agent
    mission: >-
      You are an autonomous creative agent. Discover Image Skill from public
      surfaces and determine whether it can help with a durable image creation
      task. Stay within the declared no-spend caps. Leave feedback if the
      workflow is confusing.
execution:
  target: e2b-terminal
  timeoutMs: 600000
  terminal:
    transport: pty
    stdin: disabled
  runtimeAuth: openai-env
scenario:
  mode: live
  caps:
    maxUsd: 0
    maxJobs: 0
    maxMinutes: 10
policies:
  allowPrivateRepoAccess: false
  allowProviderCredentials: false
  allowPaymentCredentials: false
  allowGitHubMutation: false
```

The exact schema can differ. The acceptance requirement is parity with the behavior and proof listed below.

## Safety and credential boundary requirements

The lane must enforce these properties by construction and by verifier checks:

- The study agent sees only public product surfaces and the prompt/mission.
- The agent must not clone or inspect the downstream private/source repo.
- Operator credentials may be loaded locally, but secret values must never persist into the run bundle, Observer data, command logs, or transcripts.
- E2B auth is operator-side only.
- LLM runtime auth such as `OPENAI_API_KEY`/`CODEX_API_KEY` is injected only into the command-scoped Codex process, not into sandbox global env, prompts, metadata, transcripts, or artifacts.
- Direct media provider keys, payment provider keys, deploy tokens, database URLs, GitHub write tokens, and private product credentials must be excluded from sandbox env and artifacts.
- E2B sandbox metadata must use a positive allowlist and contain no prompts, tokens, user data, or secret values.
- Operator stdin must be disabled by default. If assisted input is ever enabled, every input event must be persisted as an intervention and the run must be marked non-comparable to unassisted runs.

## Artifact parity needed

Mimetic does not need to copy Image Skill's filenames exactly, but it needs equivalent durable evidence and verifier coverage.

Required evidence categories:

- run identity and scenario metadata
- prompt/mission actually given to the agent
- actor runtime metadata and command line, redacted where needed
- substrate lifecycle ledger: create, readiness, execution, cleanup
- command log ledger
- PTY terminal raw event stream
- normalized terminal transcript
- operator interventions ledger, even if empty
- agent transcript/report summary
- cost/spend ledger with unknowns represented as `null`, not guessed
- no-spend proof: product credits/jobs/payment/provider/media spend stayed at zero for no-spend scenarios
- product feedback proof or feedback candidate lifecycle
- cleanup proof for the sandbox
- redaction/leak scan over public-bound text artifacts
- Observer data for both live watch and immutable replay
- verification result that fails closed when required evidence is missing

For Image Skill specifically, the adapter must be able to record product-specific concepts without making them Mimetic core primitives:

- public CLI/product command observed
- hosted product success or blocker observed
- feedback id or feedback draft observed
- media/job/asset ids when present
- explicit no-media/no-provider-spend proof for no-spend studies
- defection/friction risk summary

## Acceptance criteria

A fresh Codex session should be able to close this issue by implementing and proving the following:

1. Add a terminal-product lab route that can run a real `codex-exec` style actor in an E2B terminal/shell with PTY capture.
2. Add config parsing and docs for the terminal-product subject/lab shape.
3. Add a fixture or example lab that is public-safe and does not require private repo access. It can target a mock CLI product in CI and document an Image Skill public-surface example for live operator runs.
4. Add verifier checks for terminal/substrate/cost/no-spend/cleanup/intervention evidence.
5. Add Observer support for live and immutable terminal-product streams.
6. Ensure blocked/failed terminal-product runs still produce structurally verifiable bundles when the failure itself is the evidence.
7. Preserve public-safety defaults: no GitHub mutation, no spend, no provider/payment/deploy credentials, no secret values in artifacts.
8. Add product-adapter hooks so Image Skill can attach product-specific scoring and feedback lifecycle without forking Mimetic core.
9. Fix the existing blocked-browser missing-screenshot reference issue or share the same artifact-reference discipline in the new verifier work.
10. Include deterministic tests plus at least one live/key-gated proof receipt for the E2B terminal route.

## Suggested proof commands

Minimum deterministic proof:

```bash
pnpm check
pnpm mimetic -- lab run <terminal-product-fixture> --dry-run --json --no-open
pnpm mimetic -- verify --run latest --json
pnpm mimetic -- watch --run latest --detach --no-open --json
```

Key-gated live proof, using only explicit operator env:

```bash
pnpm mimetic -- lab run <terminal-product-live-fixture> \
  --env-file .mimetic/local/provider.env \
  --json \
  --no-open
pnpm mimetic -- verify --run latest --json
pnpm mimetic -- feedback issue --run latest --repo danielgwilson/mimetic-cli --format markdown
```

Expected live proof properties:

- real E2B sandbox created and cleaned up
- real Codex runtime invoked
- PTY terminal stream persisted
- `terminal/interventions` equivalent exists and is empty
- runtime auth is command-scoped
- no direct provider/payment/deploy/GitHub/database credentials appear in sandbox env or artifacts
- no product/media/payment spend for no-spend scenario
- Observer renders the terminal stream and artifact panel
- `mimetic verify` returns `ok: true`

## Non-goals

- Do not add live payment execution.
- Do not require downstream repos to expose private source code to the study agent.
- Do not make Image Skill-specific nouns hard-coded into Mimetic core.
- Do not claim browser/app proof is equivalent to autonomous CLI product adoption proof.
- Do not mutate GitHub from the feedback command by default.

## Why this matters

Mimetic is already close to being the shared harness substrate. The missing piece is the real-agent terminal-product proof lane. Once this lands, Image Skill should be able to migrate from bespoke sim/Observer infrastructure to a Mimetic lab plus a thin Image Skill adapter, while preserving the proof quality that currently blocks replacement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add terminal-product real-agent lab parity so Image Skill can migrate from its bespoke sim #154

Summary

Public black-box evaluation notes

Blocking Image Skill use case

Required new capability

Safety and credential boundary requirements

Artifact parity needed

Acceptance criteria

Suggested proof commands

Non-goals

Why this matters

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add terminal-product real-agent lab parity so Image Skill can migrate from its bespoke sim #154

Description

Summary

Public black-box evaluation notes

Blocking Image Skill use case

Required new capability

Safety and credential boundary requirements

Artifact parity needed

Acceptance criteria

Suggested proof commands

Non-goals

Why this matters

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions