Skip to content

Fix/elevenlabs multi stream input#25

Open
Jacob-Lasky wants to merge 25 commits intomainfrom
fix/elevenlabs-multi-stream-input
Open

Fix/elevenlabs multi stream input#25
Jacob-Lasky wants to merge 25 commits intomainfrom
fix/elevenlabs-multi-stream-input

Conversation

@Jacob-Lasky
Copy link
Copy Markdown
Contributor

@Jacob-Lasky Jacob-Lasky commented Mar 2, 2026

Switches the ElevenLabs WebSocket URL from stream-input to multi-stream-input. The old endpoint doesn't support barge-in and causes FailedToSpeak after turn 1 — Deepgram connects once, ElevenLabs closes the socket after the greeting, and Deepgram never reconnects. Also removes voice_id from the provider block (rejected by Deepgram when a custom endpoint is set) and language_code (not supported by eleven_turbo_v2_5).


Note

Medium Risk
Moderate risk: introduces filesystem-backed config CRUD, rewrites agent settings construction (including ElevenLabs endpoint wiring), and adjusts SocketIO/threaded asyncio startup, all of which can affect session startup and runtime behavior.

Overview
Moves demo configuration to JSON and exposes config CRUD. AgentTemplates is rewritten to load configs/*.json, build agent Settings dynamically (sorted/filtered configs, default/disabled support), and client.py adds GET/POST/DELETE /configs for managing configs on disk.

Session startup now uses config-driven defaults and adds hotword support. The start_voice_agent SocketIO handler accepts config_id (fallback to legacy industry), loads the selected config to default voiceModel/voiceName/language, and tweaks SocketIO/threaded asyncio setup (CORS enabled; dedicated event loop policy) to reduce eventlet loop conflicts. Hotword detection is added via new functions in common/agent_functions.py and is injected into the agent prompt/functions list when a config specifies hotword.

Frontend is redesigned and integrated with the new config API. templates/index.html switches to Deepgram design system layout, loads configs from /configs into selectable cards, adds a slide-in builder form that POSTs to /configs, and updates Start Session to emit config_id plus browser-audio capture/playback. static/style.css is reduced to minimal design-system overrides.

Deployment/config hygiene updates. Adds multiple demo JSON files (including ttsProvider: eleven_labs config fields), updates fly.toml with concurrency and health checks, and expands .gitignore/.dockerignore to ignore .claude/ and .planning/ while keeping fly.toml in the Docker context.

Written by Cursor Bugbot for commit ea09850. This will update automatically on new commits. Configure here.

Jake Lasky and others added 25 commits February 26, 2026 09:53
…Manny stub)

- configs/dubai-real-estate.json: Sophia luxury concierge, Emirates Premium Properties, aura-2-amalthea-en
- configs/hey-saga.json: Saga smart city assistant with Hey Saga hotword, aura-2-arcas-en
- configs/deepgram.json: Deepgram tech support, system prompt sourced from prompt_templates.py
- configs/hey-manny.json: Filipino BPO champion, en-PH, phone_ui mode, Hey Manny hotword
- configs/bpo-tagalog.json: Luna Tagalog BPO agent, language tl
- Mark Phase 1 as Complete in STATE.md
- Add 01-SUMMARY.md with verification results and decisions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace static/style.css with minimal overrides; Deepgram design system loaded via CDN
- Rewrite templates/index.html with dg-columns 3-panel layout (sidebar | conversation | event log)
- Demo selector renders dg-card--selectable cards populated from GET /configs
- Builder slide-in panel with full form: name, company, personality, system prompt, greeting, language, voice model, hotword, mode, function toggles
- Builder POSTs to /configs and refreshes card grid without page reload; edit pre-populates form
- Start/stop button uses dg-btn--primary with mic icon; dg-status component shows connection state
- Force dark mode via :root { color-scheme: dark; }; brand green #13ef95 used for agent messages
- All existing SocketIO audio logic preserved verbatim: audio capture, resampling, playback, event handlers
- Font Awesome icons loaded from CDN; vanilla JS only, no frameworks
…d session start

- handle_start_voice_agent now accepts config_id (new) or industry (legacy) with fallback
- Loads full JSON config via AgentTemplates.load(config_id) for voice model/language defaults
- Frontend sends config_id alongside industry for backward compat
- Frontend guards: requires selectedConfig before starting session
- Wire config_id from frontend through SocketIO handler to VoiceAgent
- Load full JSON config in handler for voice model/language defaults
- Frontend guards Start Session when no config selected
- All 5 demo configs verified via smoke test
- No audio/WebSocket processing logic modified
- Phase 4 status updated to Complete in STATE.md
Chrome's autoplay policy suspends AudioContext created outside a user
gesture (e.g. inside a SocketIO callback). Move audioOutputContext
creation into startAudioCapture() which runs from the Start button
click, ensuring it's always created within a trusted user event.

Firefox was unaffected due to more lenient autoplay handling.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
With 2 Fly machines, SocketIO was falling back to HTTP polling when
requests hit different machines (Invalid session errors). Binary audio
chunks can't be encoded in polling payloads, causing:
  TypeError: can only concatenate str (not "bytes") to str

Fix: force transports: ['websocket'] on client, allow_upgrades=False
on server. WebSocket connections are long-lived and stick to one
machine, eliminating the cross-machine session issue entirely.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- deepgram.json marked default:true, sorts first via load_all()
- All English configs language fixed: en-US/en-PH -> en (matches TTS API)
- Builder language select now dynamically populated from TTS models
- Builder language change listener repopulates voice models correctly
- openBuilder() repopulates voices for config language before setting value
- Auto-select first config on page load

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace {{agentName}} in systemPrompt/greeting with voice name at init
- Use DefaultEventLoopPolicy().new_event_loop() to avoid eventlet conflict

NOTE: stop/restart thread tracking fix is incomplete - see next commit

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- On stop: close the Deepgram WebSocket before clearing voice_agent,
  and use call_soon_threadsafe for task cancellation (thread-safe)
- On start: track thread in voice_agent_thread global, join previous
  thread (2s timeout) before starting new one to prevent loop conflicts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Every demo now introduces itself using the selected voice model's name.
Also sets Thalia as the default voice for the Deepgram Tech Support demo.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…on call

agentName fix: use caller-provided voiceName/voiceModel for {{agentName}}
substitution instead of the config's default voice model. Previously the
config's voiceModel (e.g. Thalia) was always used regardless of selection.

Hotword: when a config has a 'hotword' field, the agent is put into hotword
mode. check_hotword is added to the functions list and the system prompt
instructs the LLM to call it before every response. The Python implementation
checks if the hotword appears in the transcript and returns either
{active: false} (stay silent) or {active: true, query: "..."} (respond to
the extracted query after the hotword).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
STT transcribes 'Hey Saga' as 'Hey, Saga.' with commas and periods.
Use a regex pattern that allows punctuation/whitespace between hotword
words so 'hey saga' matches 'Hey, Saga.' or 'Hey Saga!' etc.

Also clarify function description to pass only the current utterance,
not accumulated conversation history.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Once the hotword fires, subsequent turns pass through check_hotword
without needing the hotword again. A 30-second inactivity timeout resets
to hotword-only mode. _last_activity_time updates on each turn while
the conversation is active, so as long as the user keeps talking the
session stays open.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The LLM can now call close_hotword_session when the user signals they're
done (thanks, got it, okay, that's all, etc.), resetting _conversation_active
to False so the agent returns to silent hotword-only listening immediately
rather than waiting for the 30-second inactivity timeout.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- hey-manny.json: add disabled:true (hidden from demo selector, config preserved)
- load_all() now filters configs with disabled:true
- dubai-real-estate.json: default voice changed from Amalthea to Pandora

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- agent_templates.py: conditional listen provider adds language field for non-English STT (Nova-3 + language="tl" etc)
- agent_templates.py: conditional speak provider builds ElevenLabs provider when config has ttsProvider="eleven_labs"
- bpo-tagalog.json: add ttsProvider, elevenLabsVoiceId, elevenLabsModel fields

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Feminine voice: G1AxVA91PtrWu96MHgTC

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
voice_id goes in the endpoint URL path, api_key goes in endpoint
headers as xi-api-key — not inline in the provider object.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ElevenLabs uses 'fil' not 'tl' for Tagalog — sending an unrecognized
language_code causes the WebSocket to close unexpectedly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Deepgram's internal ElevenLabsSpeakProvider struct has a voice_id field
that must be populated for multi-turn reconnections to work. Without it,
voice_id is None and ElevenLabs returns audio:null on the second turn.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…vements

- Switch ElevenLabs endpoint from wss://stream-input to wss://multi-stream-input
  (stream-input doesn't support barge-in; multi-stream-input does)
- Remove voice_id from ElevenLabs provider block (Deepgram rejects it when
  a custom endpoint is set)
- Hide language/voice model selects from sidebar (set automatically from config)
- Demo cards now show voice name and TTS provider info
- Capitalize voice name in card display
- Preserve default:true flag when editing configs so default card stays on top
- Fix voiceModel/voiceName not updating after config save (re-select with fresh
  data on loadConfigs so currentVoiceModel/voiceName stay in sync)
- Remove Mode field from builder modal (always voice_agent)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Free Tier Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

if voiceModel != "aura-2-thalia-en":
voice_model = voiceModel
if language != "en":
config_language = language
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentinel-based override ignores explicit default voice/language selection

Medium Severity

The override logic uses hardcoded default values ("aura-2-thalia-en" and "en") as sentinels to detect whether the caller explicitly chose a voice model or language. Since handle_start_voice_agent always forwards whatever the frontend sends (which could be exactly these defaults), a user who explicitly selects "aura-2-thalia-en" from the voice dropdown for a config that uses a different model (e.g. "aura-2-arcas-en" in hey-saga) will have their choice silently ignored — the config's model is used instead.

Fix in Cursor Fix in Web

if path.exists():
path.unlink()
return True
return False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path traversal in config CRUD allows arbitrary file access

High Severity

The save, load, and delete static methods construct file paths using unsanitized config_id values (e.g., CONFIGS_DIR / f"{config_id}.json"). A config_id containing ../ can escape the configs/ directory. Since POST /configs and DELETE /configs/<config_id> are unauthenticated Flask routes on a publicly deployed Fly.io app, an attacker can write arbitrary JSON files or delete files anywhere the process has permissions.

Additional Locations (1)

Fix in Cursor Fix in Web

document.getElementById('builder-form').addEventListener('submit', async (e) => {
e.preventDefault();
const form = e.target;
const data = Object.fromEntries(new FormData(form));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FormData flattening corrupts multi-value functions field to string

Medium Severity

The builder form has multiple checkboxes all sharing name="functions", but the submit handler uses Object.fromEntries(new FormData(form)) which silently drops all but the last value for duplicate keys. This saves "functions": "end_call" (a single string) instead of the expected array, corrupting the config's functions field on every save via the builder form.

Additional Locations (1)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants