Open
Conversation
Alex-Wengg
approved these changes
Mar 30, 2026
Author
|
@Alex-Wengg the merge checks seem to pass but always stop on "test-tts" - I'm struggling to find more information about what that is or why it's not completing. Any idea? |
Member
|
@danielrothmann thats fine we can ignore test tts |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR implements a session API for PocketTTS. Closes #465
The goal was to improve reliability of long-running sessions with streaming text input. Previously, each call to
synthesizeStreaming()paid the full voice prefill cost (~125 sequential CoreML predictions) and reset Mimi decoder state, causing latency and audio discontinuity between utterances.PocketTtsSessionis a new actor that performs voice prefill once at creation, then accepts streamed text viaenqueue(). Each utterance only pays the text prefill cost. Mimi decoder state persists across utterances for audio continuity.Cancellation is awaitable:
await session.cancel()blocks until the generation task has fully stopped and the Neural Engine is free, preventing multiple inference loops from stacking up. If the consumer drops theframesstream, generation is cancelled automatically.AudioFramenow includes anutteranceIndexfield for text synchronisation on the consumer side.