Goal
Extend the tiann#692 VoiceTransportProvider abstraction so each backend exposes standalone speech-to-text (transcription) and text-to-speech (synthesis) endpoints, alongside its existing realtime conversational session. Invisible infrastructure: no UI change, nothing removed. Unblocks the per-session voice swap (#27); independently shippable and a clean standalone upstream PR.
Spec
Provider capability (verified 2026-06-04)
All three current backends expose standalone STT and TTS, decoupled from their realtime conversational layer:
- ElevenLabs:
POST /v1/speech-to-text (Scribe v1/v2, batch + realtime WS) + POST /v1/text-to-speech/{voice_id}
- Gemini: audio understanding / Cloud Speech-to-Text (Chirp) + Gemini-TTS / Cloud TTS
- Qwen / DashScope: Paraformer / Qwen3-ASR / SenseVoice (ASR) + CosyVoice / Qwen-TTS
Endpoint shapes differ per backend (e.g. Gemini STT = send audio to generateContent, not a dedicated transcription service), so the abstraction needs a small per-backend capability shim — plumbing, not a blocker.
Acceptance
Out of scope
Dependencies
Suggested PR breakdown
1 PR: STT + TTS capability exposure on the abstraction + per-backend shim.
Risks
- Endpoint-shape divergence across providers (handled by the shim). The earlier "a provider lacks a standalone endpoint" worry is theoretical for the current set — all three expose both — so the only real fallback concern is unknown future backends; design the shim with a capability flag + graceful disable.
Notes
Deliberately decoupled from the visible per-session change: this lands early and invisibly, while the conversational-button → dictate + read-back swap happens atomically at the chrome-voice switch (#27). That ordering means there's never an interim period with two ways to talk inside a session.
Goal
Extend the tiann#692 VoiceTransportProvider abstraction so each backend exposes standalone speech-to-text (transcription) and text-to-speech (synthesis) endpoints, alongside its existing realtime conversational session. Invisible infrastructure: no UI change, nothing removed. Unblocks the per-session voice swap (#27); independently shippable and a clean standalone upstream PR.
Spec
docs/plans/2026-06-03-overseer-build-sequence.mdStep 5 (the per-session reshaping this enables)Provider capability (verified 2026-06-04)
All three current backends expose standalone STT and TTS, decoupled from their realtime conversational layer:
POST /v1/speech-to-text(Scribe v1/v2, batch + realtime WS) +POST /v1/text-to-speech/{voice_id}Endpoint shapes differ per backend (e.g. Gemini STT = send audio to
generateContent, not a dedicated transcription service), so the abstraction needs a small per-backend capability shim — plumbing, not a blocker.Acceptance
Out of scope
Dependencies
Suggested PR breakdown
1 PR: STT + TTS capability exposure on the abstraction + per-backend shim.
Risks
Notes
Deliberately decoupled from the visible per-session change: this lands early and invisibly, while the conversational-button → dictate + read-back swap happens atomically at the chrome-voice switch (#27). That ordering means there's never an interim period with two ways to talk inside a session.