Skip to content

Voice provider abstraction: expose standalone STT + TTS endpoints #29

@heavygee

Description

@heavygee

Goal

Extend the tiann#692 VoiceTransportProvider abstraction so each backend exposes standalone speech-to-text (transcription) and text-to-speech (synthesis) endpoints, alongside its existing realtime conversational session. Invisible infrastructure: no UI change, nothing removed. Unblocks the per-session voice swap (#27); independently shippable and a clean standalone upstream PR.

Spec

Provider capability (verified 2026-06-04)

All three current backends expose standalone STT and TTS, decoupled from their realtime conversational layer:

  • ElevenLabs: POST /v1/speech-to-text (Scribe v1/v2, batch + realtime WS) + POST /v1/text-to-speech/{voice_id}
  • Gemini: audio understanding / Cloud Speech-to-Text (Chirp) + Gemini-TTS / Cloud TTS
  • Qwen / DashScope: Paraformer / Qwen3-ASR / SenseVoice (ASR) + CosyVoice / Qwen-TTS

Endpoint shapes differ per backend (e.g. Gemini STT = send audio to generateContent, not a dedicated transcription service), so the abstraction needs a small per-backend capability shim — plumbing, not a blocker.

Acceptance

  • VoiceTransportProvider exposes an STT/transcription capability per backend (verify what feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime tiann/hapi#692 already exposes; add only what's missing).
  • VoiceTransportProvider exposes a TTS/synthesis capability per backend.
  • Both resolve to the operator's currently-chosen voice service (settings); no cross-provider mixing.
  • A per-backend capability shim normalizes the differing endpoint shapes behind one interface.
  • No user-facing UI change and nothing removed in this issue — purely additive infra.

Out of scope

Dependencies

Suggested PR breakdown

1 PR: STT + TTS capability exposure on the abstraction + per-backend shim.

Risks

  • Endpoint-shape divergence across providers (handled by the shim). The earlier "a provider lacks a standalone endpoint" worry is theoretical for the current set — all three expose both — so the only real fallback concern is unknown future backends; design the shim with a capability flag + graceful disable.

Notes

Deliberately decoupled from the visible per-session change: this lands early and invisibly, while the conversational-button → dictate + read-back swap happens atomically at the chrome-voice switch (#27). That ordering means there's never an interim period with two ways to talk inside a session.

Metadata

Metadata

Assignees

No one assigned

    Labels

    architectureArchitectural / substrate workfleet-overseerFleet attention-arbitration architecturevoiceVoice conversation / transport surface

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions