Voice provider abstraction: expose standalone STT + TTS endpoints

## Goal

Extend the #692 VoiceTransportProvider abstraction so each backend exposes standalone **speech-to-text** (transcription) and **text-to-speech** (synthesis) endpoints, alongside its existing realtime conversational session. **Invisible infrastructure: no UI change, nothing removed.** Unblocks the per-session voice swap (#27); independently shippable and a clean standalone upstream PR.

## Spec

- `docs/plans/2026-06-03-overseer-build-sequence.md` Step 5 (the per-session reshaping this enables)
- Upstream alignment: tiann in discussion #691 — lightweight per-session voice = STT to compose prompts + TTS to read summaries. Builds on the landed provider abstraction (#692).

## Provider capability (verified 2026-06-04)

All three current backends expose standalone STT and TTS, decoupled from their realtime conversational layer:

- **ElevenLabs**: `POST /v1/speech-to-text` (Scribe v1/v2, batch + realtime WS) + `POST /v1/text-to-speech/{voice_id}`
- **Gemini**: audio understanding / Cloud Speech-to-Text (Chirp) + Gemini-TTS / Cloud TTS
- **Qwen / DashScope**: Paraformer / Qwen3-ASR / SenseVoice (ASR) + CosyVoice / Qwen-TTS

Endpoint *shapes* differ per backend (e.g. Gemini STT = send audio to `generateContent`, not a dedicated transcription service), so the abstraction needs a small per-backend capability shim — plumbing, not a blocker.

## Acceptance

- [ ] VoiceTransportProvider exposes an STT/transcription capability per backend (verify what #692 already exposes; add only what's missing).
- [ ] VoiceTransportProvider exposes a TTS/synthesis capability per backend.
- [ ] Both resolve to the operator's currently-chosen voice service (settings); no cross-provider mixing.
- [ ] A per-backend capability shim normalizes the differing endpoint shapes behind one interface.
- [ ] No user-facing UI change and nothing removed in this issue — purely additive infra.

## Out of scope

- Any per-session UI change (composer mic, read-back, removing the conversational button) — that is the atomic switch in #27.
- The Overseer / chrome conversational surface (#27, #25).
- Events substrate (#22-#26) — independent track.

## Dependencies

- Blocked by: #692 (provider abstraction)
- Blocks: #27 (the per-session swap consumes these endpoints)
- Relates to: #21 (sibling on the parallel transport track)
- Part of: #19

## Suggested PR breakdown

1 PR: STT + TTS capability exposure on the abstraction + per-backend shim.

## Risks

- Endpoint-shape divergence across providers (handled by the shim). The earlier "a provider lacks a standalone endpoint" worry is **theoretical for the current set** — all three expose both — so the only real fallback concern is unknown future backends; design the shim with a capability flag + graceful disable.

## Notes

Deliberately decoupled from the visible per-session change: this lands early and invisibly, while the conversational-button → dictate + read-back swap happens **atomically** at the chrome-voice switch (#27). That ordering means there's never an interim period with two ways to talk inside a session.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Voice provider abstraction: expose standalone STT + TTS endpoints #29

Goal

Spec

Provider capability (verified 2026-06-04)

Acceptance

Out of scope

Dependencies

Suggested PR breakdown

Risks

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Voice provider abstraction: expose standalone STT + TTS endpoints #29

Description

Goal

Spec

Provider capability (verified 2026-06-04)

Acceptance

Out of scope

Dependencies

Suggested PR breakdown

Risks

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions