fix(voice): real voice-in by default for hosted ElevenLabs ConvAI#707
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughHosted ElevenLabs voice flows now stream real PCM on each scripted user turn, record audio commits, apply a hard receive timeout ceiling, and add regression coverage for wire frames, shape guards, and hosted live multi-turn scenarios. ChangesElevenLabs real-audio multi-turn
Sequence Diagram(s)sequenceDiagram
participant Scenario as scenario.run()
participant Agent as hosted ElevenLabs agent
participant Judge as judge agent
Scenario->>Agent: start live multi-turn run
loop scripted user turns
Scenario->>Agent: stream real PCM user turn
Agent-->>Scenario: agent audio response
end
Scenario->>Judge: scenario.judge()
Judge-->>Scenario: evaluate SUPPORT_CRITERIA
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (3)
scripts/provision_elevenlabs_agent.py (1)
82-116: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueParameterize the
_turn_configreturn annotation.The helper returns a heterogeneous mapping (
float/str/boolvalues) but is annotated as a baredict. Use a parameterized type so it survives pyright checking and matches the project's typing convention.♻️ Proposed annotation
-def _turn_config() -> dict: +def _turn_config() -> dict[str, float | str | bool]:As per coding guidelines: "Prefer
list[T]anddict[K, V]syntax overList[T]andDict[K, V]" and "Code must pass pyright type checking without errors, preferably usingpyright --strict".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/provision_elevenlabs_agent.py` around lines 82 - 116, The _turn_config helper is annotated with an unparameterized dict even though it returns a mapping of mixed value types; update the return annotation to a parameterized dict type that matches the actual payload and aligns with pyright strict typing. Keep the fix localized to _turn_config in scripts/provision_elevenlabs_agent.py, and use the project’s preferred built-in generic syntax so the annotation accurately describes the float, str, and bool values returned.Source: Coding guidelines
javascript/src/voice/adapters/elevenlabs.ts (1)
378-398: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winMove private helpers below public methods.
streamSpeechThenSilence,sendUserMessage, andsendSilenceTailare private helpers but sit before later public APIs likereceiveAudioandonMessage. As per coding guidelines,**/*.tsclasses should place public methods first and private methods at the bottom.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@javascript/src/voice/adapters/elevenlabs.ts` around lines 378 - 398, Move the private helpers in ElevenLabs adapter below the public API methods. In the `ElevenLabs` class, relocate `streamSpeechThenSilence`, `sendUserMessage`, and `sendSilenceTail` so they appear after public methods like `receiveAudio` and `onMessage`, keeping all public methods first and private helpers grouped at the bottom.Source: Coding guidelines
python/scenario/voice/adapters/elevenlabs.py (1)
184-186: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAnnotate the new counter attributes.
These public observability fields should be explicitly typed for pyright/strict-mode compatibility.
Proposed fix
- self.text_commit_count = 0 + self.text_commit_count: int = 0 #: User turns committed by streaming real PCM (audio reached EL's STT). - self.audio_commit_count = 0 + self.audio_commit_count: int = 0As per coding guidelines,
python/**/*.pyrequires explicit type annotations for class attributes.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/scenario/voice/adapters/elevenlabs.py` around lines 184 - 186, The new public counter attributes in ElevenLabs adapter are missing explicit class-level type annotations, which breaks pyright/strict-mode expectations. Add explicit types to the observability fields on the ElevenLabs adapter class for text_commit_count and audio_commit_count, keeping them initialized in the same place and matching the existing counter semantics. Use the class definition containing these counters in elevenlabs.py so the attributes are clearly typed and compliant with the python/**/*.py guideline.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@javascript/src/voice/adapters/__tests__/elevenlabs-real-audio.test.ts`:
- Around line 97-112: The JSDoc currently attached to emitUserTranscript
describes the behavior of driveTwoTurns instead, so move that multi-line comment
to driveTwoTurns and give emitUserTranscript a short one-line doc that matches
its role of emitting a user_transcript message. Keep the function names aligned
with their documentation so the test helpers in elevenlabs-real-audio.test.ts
are correctly described.
In `@javascript/src/voice/adapters/elevenlabs.ts`:
- Around line 361-375: The timeout/recovery guidance in ElevenLabs voice
handling needs to distinguish the new "audio" path from the legacy "silence"
fallback. Update the downstream receiveAudio timeout/error messaging in
elevenlabs.ts so that failures from the real-audio stream path point users to
the agent-level turn configuration and "audio"-specific recovery steps, rather
than only suggesting the "silence" to "text" fallback. Use the existing
streamSpeechThenSilence and receiveAudio flow to locate the message and adjust
the guidance text accordingly.
- Around line 384-387: `streamSpeechThenSilence` is incrementing
`audioCommitCount` before confirming there is actual PCM to send, so empty or
text-only chunks can be counted as real speech. Update this method in
`ElevenLabsAdapter` to validate the `data` buffer first and return early for
empty PCM, then only increment `audioCommitCount` and send the websocket payload
after non-empty audio is confirmed. Keep the existing `ws.send` path intact, but
make the commit counter reflect only real audio commits.
In `@python/scenario/voice/adapters/elevenlabs.py`:
- Around line 300-315: In the ElevenLabs adapter, the audio-mode path currently
fails later with a generic recv_audio timed out message, which hides the real
provisioning problem; update the error handling around the audio/silence branch
that calls _stream_speech_then_silence so mode-specific guidance is surfaced
when turn-taking does not behave as expected. Make the failure message
explicitly mention that "audio" requires the agent-level turn_timeout and
turn_eagerness provisioning (the provisioned-agent path) instead of implying a
fallback to the legacy hosted behavior, and keep the messaging distinct from the
"silence" and "text" paths.
- Around line 323-325: The `audio_commit_count` update in `ElevenLabsAdapter` is
counting empty PCM chunks as real commits, so adjust the commit path to skip
incrementing when `data` is empty and only send the empty/silence payload
without treating it as a real audio turn. Update the logic around the
`audio_commit_count` increment and `_ws.send(...)` call in the adapter’s audio
commit flow so observability only reflects actual streamed audio.
---
Nitpick comments:
In `@javascript/src/voice/adapters/elevenlabs.ts`:
- Around line 378-398: Move the private helpers in ElevenLabs adapter below the
public API methods. In the `ElevenLabs` class, relocate
`streamSpeechThenSilence`, `sendUserMessage`, and `sendSilenceTail` so they
appear after public methods like `receiveAudio` and `onMessage`, keeping all
public methods first and private helpers grouped at the bottom.
In `@python/scenario/voice/adapters/elevenlabs.py`:
- Around line 184-186: The new public counter attributes in ElevenLabs adapter
are missing explicit class-level type annotations, which breaks
pyright/strict-mode expectations. Add explicit types to the observability fields
on the ElevenLabs adapter class for text_commit_count and audio_commit_count,
keeping them initialized in the same place and matching the existing counter
semantics. Use the class definition containing these counters in elevenlabs.py
so the attributes are clearly typed and compliant with the python/**/*.py
guideline.
In `@scripts/provision_elevenlabs_agent.py`:
- Around line 82-116: The _turn_config helper is annotated with an
unparameterized dict even though it returns a mapping of mixed value types;
update the return annotation to a parameterized dict type that matches the
actual payload and aligns with pyright strict typing. Keep the fix localized to
_turn_config in scripts/provision_elevenlabs_agent.py, and use the project’s
preferred built-in generic syntax so the annotation accurately describes the
float, str, and bool values returned.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 62c94239-438b-4e36-a911-3d4bf3a23e6a
📒 Files selected for processing (8)
javascript/examples/vitest/tests/voice/elevenlabs-705-real-audio-multiturn.test.tsjavascript/src/voice/adapters/__tests__/elevenlabs-real-audio.test.tsjavascript/src/voice/adapters/elevenlabs.tspython/examples/voice/elevenlabs_hosted.pypython/scenario/voice/adapters/elevenlabs.pypython/tests/voice/test_elevenlabs_hosted_e2e.pypython/tests/voice/test_elevenlabs_turn_commit.pyscripts/provision_elevenlabs_agent.py
58d524e to
5a8c184
Compare
… guards/coverage Addresses /review on #707: - elevenlabs.ts: clarify the keepalive hard-ceiling comment (it is never reset by pings, so it always bounds wall-clock; sizing rationale spelled out) and make the timeout error accurate when the agent is keepalive-pinging (not "silent"). - shape-guard: fix a now-backwards AC8 comment (post-#705 the adapter streams PCM, it does not send user_message); strip /* */ block comments too before the negative user_message assertion (was // only — fragile). - real-audio test: add the silenceTailBytes constructor-validation coverage that was lost when elevenlabs-turn-commit.test.ts was deleted. No behavior change. Design-soundness review verified the silence-tail + ceiling are not reinventing EL SDK built-ins (EL ships no end-of-turn client event). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
javascript/src/voice/adapters/elevenlabs.ts (1)
319-340: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winKeep the private helpers below the public API section.
streamSpeechThenSilence()andsendSilenceTail()now splitsendAudio()from the publicreceiveAudio()/onMessage()methods. Move both helpers below the public methods so the class keeps its public surface grouped together.As per coding guidelines,
**/*.ts: In TypeScript classes, place public methods first, private methods at the bottom, and group related methods together.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@javascript/src/voice/adapters/elevenlabs.ts` around lines 319 - 340, The class in elevenlabs.ts has private helpers grouped above the public API, which breaks the TypeScript method ordering convention. Move streamSpeechThenSilence() and sendSilenceTail() below the public methods such as receiveAudio() and onMessage(), keeping the public surface together and the private helper methods at the bottom while preserving their current behavior and call sites.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@javascript/examples/vitest/tests/voice/elevenlabs-705-real-audio-multiturn.test.ts`:
- Around line 52-72: The helper in assertRealVoiceMultiTurn is too permissive:
it only logs result.success, assumes a fixed minimum audioCommitCount, and
checks only the latest lastUserTranscript. Update it to assert the judged run
actually succeeded, make the expected scripted user-turn count explicit per
pattern instead of hardcoding >=2, and extend the ElevenLabs adapter contract in
voice/adapters/elevenlabs.ts to retain per-turn transcript/commit history so the
test can verify each scripted user turn individually.
In `@javascript/src/voice/__tests__/elevenlabs-hosted-shape.guard.test.ts`:
- Around line 129-136: The shape guard test is only stripping line comments
before asserting on user_message, so block/JSDoc comments in elevenlabs.ts can
still cause false failures. Update the test in
elevenlabs-hosted-shape.guard.test.ts to remove block comments as well before
the user_message check, using the existing ADAPTER_FILE and codeOnly logic, so
the assertion reflects executable code only.
In `@javascript/src/voice/adapters/elevenlabs.ts`:
- Around line 353-357: The hard wall-clock limit in elevenlabs.ts is still tied
to the caller-provided timeout via hardCeilingMs, which allows pinging sockets
to exceed the intended 45-second ceiling. Update the timeout calculation in the
keepalive/hard-timer logic so the hard ceiling is always fixed at 45 seconds
(use the existing KEEPALIVE_HARD_CEILING_S constant directly for the absolute
ceiling), while preserving the separate idle deadline behavior driven by
timeout.
---
Nitpick comments:
In `@javascript/src/voice/adapters/elevenlabs.ts`:
- Around line 319-340: The class in elevenlabs.ts has private helpers grouped
above the public API, which breaks the TypeScript method ordering convention.
Move streamSpeechThenSilence() and sendSilenceTail() below the public methods
such as receiveAudio() and onMessage(), keeping the public surface together and
the private helper methods at the bottom while preserving their current behavior
and call sites.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: cbc8923b-f923-4a56-b5b4-63d30c0b5559
📒 Files selected for processing (6)
javascript/examples/vitest/tests/voice/elevenlabs-705-real-audio-multiturn.test.tsjavascript/src/voice/__tests__/elevenlabs-hosted-shape.guard.test.tsjavascript/src/voice/__tests__/elevenlabs-turn-commit.test.tsjavascript/src/voice/adapters/__tests__/elevenlabs-real-audio.test.tsjavascript/src/voice/adapters/__tests__/elevenlabs.test.tsjavascript/src/voice/adapters/elevenlabs.ts
💤 Files with no reviewable changes (1)
- javascript/src/voice/tests/elevenlabs-turn-commit.test.ts
Review verdict: READYReady to ship as an incremental improvement with documented limitations. This supersedes the earlier NOT-READY, which applied a "flawless" bar; the applicable bar (maintainer's decision) is ship the verified improvement over the current published state, track the rough edges as follow-ups. Delivered + verified at this HEAD (
|
d075296 to
6ddb953
Compare
Stream the user's real PCM every turn (drop the text-commit path) so the hosted agent's STT/VAD/turn-taking run on turns 2+. A 1.5s trailing silence tail closes scripted turns on a vanilla agent — no agent-side turn config needed. A keepalive hard-ceiling bounds the no-audio ping wait so receiveAudio cannot hang. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6ddb953 to
09142a4
Compare
Pattern 1 replaced the inert proceed(1) step (which resumed a half-consumed turn and ran the judge, never voicing a user turn) with an explicit unscripted scenario.user() + scenario.agent() pair. The unscripted user() makes the simulator generate its own next line via the LLM and voiceify it; the following agent() flushes that generated PCM to ElevenLabs, incrementing audioCommitCount to 3 (beyond the 2 scripted turns) and yielding a fresh STT user_transcript. Assertions now require >=3 commits and a non-empty lastUserTranscript, with messages stating this proves autonomous voice-to-voice. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
f7c8578 to
a02132b
Compare
… bug Documents the pre-fix wire shape -- a user_message text-commit carrying no user_audio_chunk, so EL STT never ran on turns 2+ -- and asserts the current adapter never emits it. Pairs with the existing fix-path guard, which fails when the real-audio path is reverted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… in-turn The "do you offer support on weekends?" phrasing consistently made the hosted EL agent hand off / end its turn (receiveAudio timed out — 0/3 across runs). An in-domain hours question passes 2/2 (audioCommits=3, real STT, judge pass). Confirms the failure was test content, not the transport. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… realtime user
Make hosted ElevenLabs multi-turn voice work through the scenario API, no hacks.
Both verified live (3 consecutive runs each) and via deterministic unit tests.
- proceed() now drives real voiced multi-turn: callAgent voiceifies the
generated user-sim turn (it was broadcast as TEXT, so EL received no audio
and the next agent turn timed out). voiceifyGeneratedUserTurn reuses the same
voiceifyText pipeline as the scripted user("...") path, gated on USER role +
a voice adapter + a voice user-sim. Live: proceed(4) -> 3 user + 4 agent.
- Realtime user via user(): new OpenAIRealtimeAgentAdapter.speakUserTurn speaks
the scripted line VERBATIM via an out-of-band response.create
(conversation:"none"), instead of the default sendText which makes the model
GENERATE a reply (the user then sounds like the agent). Its spoken audio +
transcript feed the agent under test. Live: 3 user + 4 agent, coherent.
Deterministic: 421 keyless tests pass; tsc clean. Python parity is a follow-up.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…er turn A bare scenario.user() let the gpt-5-mini sim emit a city-less opener; the agent (per its "do not guess the city" prompt) then asked for clarification instead of calling get_current_weather, intermittently failing the hasToolCall assertion and reddening ci-checks (4/4 retries in the CI run). Unrelated to the voice work — the voice diff cannot reach this pure-text path. A deterministic "weather in Barcelona" opener makes the agent call the tool every run; verified 3/3 green locally. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… two stale docs Code review of #707 surfaced one real correctness bug + two docs that contradicted the code: - speakUserTurn (openai-realtime.ts) read this.lastAgentTranscript as its transcript fallback but never reset it; on turn 2+ with audio-but-no-transcript it returned the PREVIOUS turn's line, defeating the verbatim isolation this method exists to provide. Reset to null at the top so the `?? text` fallback reads only a transcript produced by THIS turn. - findRealtimeUserAgent doc (scenario-execution.ts) still named sendText; the executor routes through speakUserTurn now. - openai-realtime-speak-user-turn unit-test header described emitting conversation.item.create; the test asserts it is NOT emitted (the #705 isolation contract — emitting it would make the model answer the line). Typecheck clean; 149 adapter unit tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011L8sGpz4ph1ryu66UgRF9T
Live evidence — proceed() + realtime user, on the fixed HEADCaptured a live hosted-ElevenLabs run on 1.
|
A realtime USER agent (OpenAI Realtime, role=USER) speaks scripted lines
verbatim via `speakUserTurn` (the `user("...")` route). Driven by
proceed()/autonomous generation it routes through `call()` instead — which,
with the realtime session's `turn_detection:null` and no out-of-band
`response.create`, yields no spoken user turn — and the executor would
broadcast that as an empty/text user turn: a silent voice→text substitution
on the user side, exactly what we never ship.
Guard `voiceifyGeneratedUserTurn`: when a voice agent is under test and the
producer is a realtime user agent with NO voiceify channel, throw a clear,
actionable error pointing at the supported path (scripted `user()`, or a
voice user-simulator for autonomous voiced turns) instead of degrading
silently. Scripted realtime-user turns route through `speakUserTurn` and
never reach this method, so they are unaffected.
Keyless test (offline): proceed(1)+realtime-user throws; scripted
user()+realtime-user does not. 379 keyless tests pass; tsc clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011L8sGpz4ph1ryu66UgRF9T
…ew fixes) Scoped re-review (5 reviewers + 4 personas) found the prior executor-only guard fired too late: for the real OpenAI realtime adapter, proceed() calls `agent.call()` -> defaultVoiceCall FIRST, which (turn_detection:null, no out-of-band response.create) blocks on a `receiveAudio` that times out *before* the executor guard runs — so the clean guidance never surfaced in production. The keyless executor test masked this (its fake `call()` returns instantly). Convergent finding: Uncle Bob, Fowler, and design-soundness. Fix — two layers, fail closed: - PRIMARY: OpenAIRealtimeAgentAdapter.call() rejects at the top when role=USER. A realtime user's scripted turns route through `speakUserTurn` (the executor add+broadcasts WITHOUT call()), so reaching call() with role=USER means autonomous/proceed drive — unsupported. Fires before any network, so the message actually surfaces. Faithful keyless test (no connect/keys needed). - BACKSTOP: the executor guard stays but is simplified to FAIL CLOSED (`isRealtimeUserAgent(producer)` — dropped the `!isVoiceUserSim` term that was fail-OPEN: it would have routed a future realtime+voiceify hybrid into TTS, the exact silent voice->text substitution we never ship). Catches any other realtime-user adapter that does not self-reject. Its test is reframed as a backstop test (fake = a non-self-rejecting adapter). - Shared message const REALTIME_USER_AUTONOMOUS_UNSUPPORTED (domain/agents) so the two sites can't drift; tests assert the const, not a substring. - Added @throws to voiceifyGeneratedUserTurn; trimmed its comment. tsc clean; 381 keyless tests pass (4 guard tests across the two layers). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011L8sGpz4ph1ryu66UgRF9T
Non-blocking polish from the scoped re-review (principles + test reviewers, no Must-Fix): - De-duplicate the guard rationale: the shared "why" lives on REALTIME_USER_AUTONOMOUS_UNSUPPORTED; the two guard sites now carry only their site-specific reason (adapter: reject before defaultVoiceCall times out; executor: backstop for a non-self-rejecting adapter). - Make the executor backstop comment precise: it converts a silent TEXT broadcast into a loud failure, and the order (before the `!isVoiceUserSim` return) is LOAD-BEARING for a hypothetical dual-shape adapter — noted so a future reorder can't silently reintroduce the substitution. - Rename the role=AGENT adapter test to state both halves; replace its try/catch with a `.catch()` one-liner. tsc clean; 381 keyless tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011L8sGpz4ph1ryu66UgRF9T
Review Brief — fix(voice): real voice-in by default for hosted ElevenLabs ConvAI (#707)Attention map, not a verdict. Gating verdict: review-verdict comment — READY. Mode: Targeted Review · Scariest thing: the fixed 1.5 s silence tail decides every turn's end-of-turn — too short for an agent's VAD and the turn never closes, hanging to the 45 s ceiling then failing, with no text fallback left · Skimmable: No 30-second cockpit
Key decisions & consequences
Inspection path (ranked)
Axis map
Claim vs evidence
Questions for the author
Don't spend time on
Blockers
ReferencesPR #707 · issue #705 · gating verdict comment (READY) · live-evidence comment · related: #708 (CI honesty / non-gating live job) · #711 (autonomous realtime-user follow-up) · #638 (old one-exchange ceiling) · #596 (v0.4.14 text-default origin) · #567 (deleted turn-commit tests). Files: |
Ruthless review — verdict: technically merge-ready, one P2 doc fix to fold inReviewed at HEAD
The three-layer fix is well-targeted: real-audio-only EL path (1.5 s tail + 45 s keepalive hard-ceiling), out-of-band verbatim P2 — two stale load-bearing comments contradict the shipped behaviorSame class as commit
This PR removed Fix: reword both to: the bridged AudioChunk's PCM is streamed as Nit (non-blocking)
Residual risks (acknowledged in the PR, not blockers)
Nothing here blocks correctness. Recommend folding the P2 comment fix into this PR before merge, since it's already in human-review. |
|
The real-voice-in multi-turn tests (
CI looks green only because the (Flagged by support after the failures surfaced — supersedes the green checks and the #dev review request.) |
29f1171 to
3811643
Compare
…carry agent transcripts Wire scenario.proceed(N) to drive an autonomous OpenAI Realtime user (role=USER) through a multi-turn voiced conversation against a hosted ElevenLabs ConvAI agent — the faithful #705 fix. Reverses the prior fail-loud guards: the executor now feeds EL's audio into the realtime session and speaks a generative next user turn each proceed() step (fork B, one in-context response.create), returned as the user's audio turn. The type-based realtime-user guard is replaced by the adapter-agnostic USER_TURN_NO_AUDIO_FOR_VOICE_AUT audio-presence invariant (trips on the produced no-audio artifact regardless of producer type). Fixes two defects found by inspecting the live LangWatch MESSAGE_SNAPSHOTs (which diverge from the on-disk recording manifest): - Missing agent-under-test transcript. Hosted EL streams raw PCM with the turn text on `agent_response` (lastAgentTranscript), but the adapter built the agent chunk from PCM only, so every agent turn reached the conversation message (and LangWatch) as audio with NO transcript — only the recording manifest got a lossy STT back-fill. defaultVoiceCall now (a) TURN-SCOPES the adapter's `lastAgentTranscript` — nulls it before a REPLY's drain (a no-incoming greeting, whose transcript is set on connect, is exempt) so a reply that emits audio but no fresh transcript cannot inherit the prior turn's text — and (b) attaches the native turn transcript to the merged chunk when it carries none, onto BOTH the message and the recording segment. A no-audio turn is never labeled with a transcript. (Turn-scoping in the shared path fixes every agent adapter uniformly, incl. OpenAI Realtime as the agent under test, which reaches it via super.call() — a cross-turn bleed an adversarial review caught.) - Doubled user-sim turns. proceed() is USER-led, so a trailing scripted user() opener immediately before proceed() produced two adjacent user turns (the agent ingests only the latest pending audio, dropping the opener). The repro script now drains the agent's reply to the opener before proceed(), so every user turn is heard and answered. The e2e asserts result.messages is clean: no consecutive same-role turns and a transcript on every agent turn. Tests: voice-agent-transcript (transcript wiring, adapter-agnostic, offline), proceed-turn-count (proceed(N) drives exactly N user turns), judge-coherence-criterion (AGENTS_HEARD_EACH_OTHER discrimination, EL-free), no-transcript-bleed regression (a reply with a stale transcript + no fresh one carries none), repro-705 multi-turn e2e (env-gated, live-proven). Unit suite 881 green, tsc + eslint clean both packages. BREAKING CHANGE: the EL adapter's SDK migration removes the public `WebSocketLike` test-seam type (superseded by the SDK's own WebSocketFactory). Niche, but note it for the version bump. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XgQFps8Vu3nrCaL37bvgYf
3811643 to
d9d5a77
Compare
Adversarial self-review — verified findings + suggested follow-upsA deeper adversarial pass over this branch (three independent reviewers — correctness / hygiene / test-fidelity — plus code-level verification of every major claim). The core design holds up, but the pass surfaced two verified MAJOR defects and two test-suite hollow-green hacks that the earlier review missed. Filing here so they're tracked against the diff. Merge-base 🔴 Verified defects (checked against the code, not just flagged)F1 — every realtime user turn stalls ~15s (drain has no F2 — the fail-closed invariant accepts a transcript-only turn that EL cannot commit Stale comments (ship a fix with F2): 🟠 Verified test-suite hacks (hollow green)Return-to-skip reports PASSED, not skipped. Self-fulfilling STT assertion. 🟡 Narrower edges (flagged by review; not independently reproduced this pass)
⚪ Hygiene (flagged by review)
🟢 What held up (validated by ≥2 reviewers + code read)
Suggested follow-up
Balance: the feature is functionally correct — it drives real multi-turn voice, transcripts reach LangWatch, and the turn count is right. F1 makes it slow (~15s/turn) and F2 makes it fragile under the known #708 flake, and CI doesn't yet guard the headline path — so this is short of "merge-ready" until at least F1/F2 + the test hacks land. |
…ness assertions Exercises the voice API surface end-to-end against a hosted ElevenLabs agent driven by an autonomous realtime user: verbatim + autonomous user turns, time-based barge-in, silence handling, and voiceProceed with interruptions, closed by the coherence judge. Asserts the on-disk recording (full.wav + manifest + per-segment WAVs, every segment transcribed). Inline comments flag the known rough edges (interrupt placement, voiceProceed turn semantics, short-turn segment fidelity).
…proceed() Remove the TypeScript-only voiceProceed + InterruptionConfig from the voice kitchen-sink demo (voiceProceed is being removed from the SDK as a Python-parity cleanup — tracked separately). The autonomous stretch now uses base scenario.proceed(7, logTurn). Wire-verified, no behavior regression: both the old voiceProceed(3) and the new proceed(7) drive ZERO autonomous realtime-USER turns (a pre-existing #705 gap, not introduced here). Inline comments corrected to the wire truth rather than a guessed N-sizing story. Run coherence remains subject to hosted-EL mishearing flakiness (#708). Verified: examples/vitest `tsc --noEmit` green; no voiceProceed/InterruptionConfig tokens remain in the file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Automated low-risk assessment This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.
This PR requires a manual review before merging. |
|
Blessed. 🙏 "Let all things be done decently and in order." — 1 Corinthians 14:40 Code's sound — tsc clean, keyless
Mind the scripted-path asymmetry too (empty realtime audio → silent EL timeout, no fail-loud like the proceed path has). |
What & why
Customer report (#705): hosted ElevenLabs ConvAI multi-turn voice driven by
scenario.proceed()failed — the agent received no real audio on the user's turns (audioCommits=1,receiveAudio timed out). This PR makesproceed(N)drive a speech-native multi-turn conversation end-to-end, and fixes two defects found by inspecting the actual LangWatch data.The fix
1.
proceed(N)drives an autonomous realtime USER against hosted ELOpenAIRealtimeAgentAdapter.call()forrole=USERnow runs a bespoke autonomous turn (_autonomousUserTurn): it feeds the agent-under-test's last audio into the realtime input buffer, fires ONE in-contextresponse.create, and returns the model's spoken next customer line as the user's audio turn (fork B, audio-context — the realtime model hears EL and replies). The prior fail-loud guard is replaced by an adapter-agnostic executor invariant,USER_TURN_NO_AUDIO_FOR_VOICE_AUT: a USER turn produced for a voice agent under test MUST carry audio, or the run fails loud — it trips on the produced no-audio artifact, regardless of producer type (strictly stronger than the old realtime-user type-check).2. Missing agent-under-test transcript (LangWatch showed agent turns with no text)
Hosted EL streams raw PCM with the turn text on its
agent_responseevent (lastAgentTranscript), but the adapter built the agent chunk from PCM only — so every agent turn reached the conversation message (→ the LangWatchMESSAGE_SNAPSHOT) as audio with no transcript. Only the on-disk recording manifest got a lossy STT back-fill, which is why the recording looked fine while the dashboard did not.defaultVoiceCallnow:lastAgentTranscript— nulls it before a reply's drain (a no-incoming greeting, whose transcript is set on connect, is deliberately exempt) so a reply that emits audio but no fresh transcript cannot inherit the previous turn's text (a cross-turn bleed an adversarial review caught on the OpenAI-Realtime-as-agent path, which reaches this viasuper.call()); and3. Doubled user-sim turns
proceed()is USER-led, so a trailing scripteduser()opener immediately beforeproceed()produced two adjacent user turns — the agent ingests only the latest pending audio, so the opener was dropped and the user appeared to speak twice. The repro script now drains the agent's reply to the opener beforeproceed(), so every user turn is heard and answered (clean alternation).How I can prove I was successful
Deterministic (CI-gated) —
ci-checksgreen on HEAD: eslint + tsc clean (both packages); the keylesssrcsuite passes (881 tests), including:voice-agent-transcript— an agent-under-test's turn carries its native transcript onto the message (adapter-agnostic, offline).proceed-turn-count—proceed(N)drives the user simulator exactly N times.proceed()audio-presence invariant tests.Live (non-gating — hosted EL is ambiently flaky, #708) — exercised THIS change on the real wire:
pnpm -F vitest-examples exec vitest run tests/voice/repro-705-proceed-multiturn.test.ts(needsELEVENLABS_API_KEY/ELEVENLABS_AGENT_ID/OPENAI_API_KEY; self-skips otherwise). On HEAD's builtdist, both tests pass:proceed(4): 4 voiced realtime-user turns + 5 agent replies, 9 segments; the e2e assertsresult.messagesis clean — no consecutive same-role turns and a transcript on every agent turn (greeting and replies).proceed(3)+judge: judgesuccess=true—AGENTS_HEARD_EACH_OTHER(graded on STT of the real audio) confirms the agents heard each other.Human verification
cd javascript && pnpm exec vitest run src/— 881 pass, includingvoice-agent-transcript(native transcript on the message), no-transcript-bleed (the turn-scope reset),proceed-turn-count, and the realtime-user audio-presence invariant.cd javascript && pnpm -r lint && pnpm exec tsc --noEmit— clean (this is theci-checksgate).pnpm -F vitest-examples exec vitest run tests/voice/repro-705-proceed-multiturn.test.ts(needsELEVENLABS_API_KEY/ELEVENLABS_AGENT_ID/OPENAI_API_KEY; self-skips otherwise) —proceed(4)4 user + 5 agent turns clean;proceed(3)+judgecoherent. Best-effort / non-gating (voice e2e CI reports success while tests fail (exit code swallowed); hosted-EL live e2e ambiently flaky #708).Scope & known limitations
scenario_executor.py) is a separate follow-up.user("…")for a realtime user is INTENT, not verbatim — the realtime model speaks the line naturally and may rephrase. Exact-word scripting → use a text or TTS user.voice-integrationjob is non-gating (ambient EL no-audio flakiness). This PR is gated by the deterministic keyless suite, not that job.The EL adapter's SDK migration removes the public
WebSocketLiketest-seam type (exported onmain) — it is obsolete, superseded by the SDK's ownWebSocketFactory. Intrinsic to the SDK migration; niche, but a semver consideration.Closes #705. Do not merge — the maintainer merges after human review.
🤖 Generated with Claude Code