Skip to content

fix(voice): real voice-in by default for hosted ElevenLabs ConvAI#707

Merged
drewdrewthis merged 13 commits into
mainfrom
fix/705-real-voice-multiturn
Jul 1, 2026
Merged

fix(voice): real voice-in by default for hosted ElevenLabs ConvAI#707
drewdrewthis merged 13 commits into
mainfrom
fix/705-real-voice-multiturn

Conversation

@drewdrewthis

@drewdrewthis drewdrewthis commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

What & why

Customer report (#705): hosted ElevenLabs ConvAI multi-turn voice driven by scenario.proceed() failed — the agent received no real audio on the user's turns (audioCommits=1, receiveAudio timed out). This PR makes proceed(N) drive a speech-native multi-turn conversation end-to-end, and fixes two defects found by inspecting the actual LangWatch data.

Direction change (please note): an earlier revision of this branch fail-loud-guarded the autonomous realtime-user path ("not supported yet", tracked as #711). Per the maintainer's direction, this revision reverses that and wires the path in. An earlier approval was against the fail-loud revision — re-review requested on the new direction.

The fix

1. proceed(N) drives an autonomous realtime USER against hosted EL

OpenAIRealtimeAgentAdapter.call() for role=USER now runs a bespoke autonomous turn (_autonomousUserTurn): it feeds the agent-under-test's last audio into the realtime input buffer, fires ONE in-context response.create, and returns the model's spoken next customer line as the user's audio turn (fork B, audio-context — the realtime model hears EL and replies). The prior fail-loud guard is replaced by an adapter-agnostic executor invariant, USER_TURN_NO_AUDIO_FOR_VOICE_AUT: a USER turn produced for a voice agent under test MUST carry audio, or the run fails loud — it trips on the produced no-audio artifact, regardless of producer type (strictly stronger than the old realtime-user type-check).

2. Missing agent-under-test transcript (LangWatch showed agent turns with no text)

Hosted EL streams raw PCM with the turn text on its agent_response event (lastAgentTranscript), but the adapter built the agent chunk from PCM only — so every agent turn reached the conversation message (→ the LangWatch MESSAGE_SNAPSHOT) as audio with no transcript. Only the on-disk recording manifest got a lossy STT back-fill, which is why the recording looked fine while the dashboard did not. defaultVoiceCall now:

  • turn-scopes the adapter's lastAgentTranscript — nulls it before a reply's drain (a no-incoming greeting, whose transcript is set on connect, is deliberately exempt) so a reply that emits audio but no fresh transcript cannot inherit the previous turn's text (a cross-turn bleed an adversarial review caught on the OpenAI-Realtime-as-agent path, which reaches this via super.call()); and
  • attaches the native turn transcript to the merged chunk when it carries none — onto BOTH the message and the recording segment. A no-audio turn is never labeled with a transcript.

3. Doubled user-sim turns

proceed() is USER-led, so a trailing scripted user() opener immediately before proceed() produced two adjacent user turns — the agent ingests only the latest pending audio, so the opener was dropped and the user appeared to speak twice. The repro script now drains the agent's reply to the opener before proceed(), so every user turn is heard and answered (clean alternation).

How I can prove I was successful

Deterministic (CI-gated) — ci-checks green on HEAD: eslint + tsc clean (both packages); the keyless src suite passes (881 tests), including:

  • voice-agent-transcript — an agent-under-test's turn carries its native transcript onto the message (adapter-agnostic, offline).
  • no-transcript-bleed — a reply with a stale prior transcript + no fresh one carries NONE (guards the turn-scope reset; the exact bleed the review found).
  • proceed-turn-countproceed(N) drives the user simulator exactly N times.
  • the realtime-user + proceed() audio-presence invariant tests.

Live (non-gating — hosted EL is ambiently flaky, #708) — exercised THIS change on the real wire: pnpm -F vitest-examples exec vitest run tests/voice/repro-705-proceed-multiturn.test.ts (needs ELEVENLABS_API_KEY/ELEVENLABS_AGENT_ID/OPENAI_API_KEY; self-skips otherwise). On HEAD's built dist, both tests pass:

  • proceed(4): 4 voiced realtime-user turns + 5 agent replies, 9 segments; the e2e asserts result.messages is clean — no consecutive same-role turns and a transcript on every agent turn (greeting and replies).
  • proceed(3)+judge: judge success=trueAGENTS_HEARD_EACH_OTHER (graded on STT of the real audio) confirms the agents heard each other.

Human verification

  1. Unit (no creds): cd javascript && pnpm exec vitest run src/ — 881 pass, including voice-agent-transcript (native transcript on the message), no-transcript-bleed (the turn-scope reset), proceed-turn-count, and the realtime-user audio-presence invariant.
  2. Lint + types: cd javascript && pnpm -r lint && pnpm exec tsc --noEmit — clean (this is the ci-checks gate).
  3. Live (creds): pnpm -F vitest-examples exec vitest run tests/voice/repro-705-proceed-multiturn.test.ts (needs ELEVENLABS_API_KEY/ELEVENLABS_AGENT_ID/OPENAI_API_KEY; self-skips otherwise) — proceed(4) 4 user + 5 agent turns clean; proceed(3)+judge coherent. Best-effort / non-gating (voice e2e CI reports success while tests fail (exit code swallowed); hosted-EL live e2e ambiently flaky #708).

Scope & known limitations

  • TypeScript only. Python parity (the same executor voiceify gap in scenario_executor.py) is a separate follow-up.
  • user("…") for a realtime user is INTENT, not verbatim — the realtime model speaks the line naturally and may rephrase. Exact-word scripting → use a text or TTS user.
  • CI honesty (voice e2e CI reports success while tests fail (exit code swallowed); hosted-EL live e2e ambiently flaky #708): the live voice-integration job is non-gating (ambient EL no-audio flakiness). This PR is gated by the deterministic keyless suite, not that job.

⚠️ BREAKING CHANGE (for the version bump)

The EL adapter's SDK migration removes the public WebSocketLike test-seam type (exported on main) — it is obsolete, superseded by the SDK's own WebSocketFactory. Intrinsic to the SDK migration; niche, but a semver consideration.

Closes #705. Do not merge — the maintainer merges after human review.

🤖 Generated with Claude Code

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Hosted ElevenLabs voice flows now stream real PCM on each scripted user turn, record audio commits, apply a hard receive timeout ceiling, and add regression coverage for wire frames, shape guards, and hosted live multi-turn scenarios.

Changes

ElevenLabs real-audio multi-turn

Layer / File(s) Summary
PCM commit path
javascript/src/voice/adapters/elevenlabs.ts, javascript/src/voice/__tests__/elevenlabs-hosted-shape.guard.test.ts
The hosted adapter removes turnCommitMode, adds audioCommitCount, streams user_audio_chunk plus a silence tail, and the shape guard checks for user_audio_chunk and the absence of executable user_message.
Receive timeout ceiling
javascript/src/voice/adapters/elevenlabs.ts, javascript/src/voice/adapters/__tests__/elevenlabs.test.ts
receiveAudio now uses a sliding idle deadline plus an absolute hard ceiling, and the keepalive tests assert the extra timer and a rejection after continuous pings.
Adapter wire regression
javascript/src/voice/adapters/__tests__/elevenlabs-real-audio.test.ts
A fake websocket harness drives two turns, captures turn-2 outbound frames, injects user_transcript, and asserts real PCM user_audio_chunk frames, no user_message frames, audioCommitCount, and lastUserTranscript.
Hosted live multi-turn suite
javascript/examples/vitest/tests/voice/elevenlabs-705-real-audio-multiturn.test.ts
The new live Vitest suite gates on hosted credentials, creates a fresh hosted ElevenLabs agent, runs three multi-turn patterns, and validates the run with commit, transcript, and judge criteria.

Sequence Diagram(s)

sequenceDiagram
  participant Scenario as scenario.run()
  participant Agent as hosted ElevenLabs agent
  participant Judge as judge agent

  Scenario->>Agent: start live multi-turn run
  loop scripted user turns
    Scenario->>Agent: stream real PCM user turn
    Agent-->>Scenario: agent audio response
  end
  Scenario->>Judge: scenario.judge()
  Judge-->>Scenario: evaluate SUPPORT_CRITERIA
Loading

Possibly related PRs

  • langwatch/scenario#596: Introduces the earlier ElevenLabs turn-commit modes that this PR removes from the hosted adapter.
  • langwatch/scenario#668: Changes the ElevenLabs receive timeout path with timer resets, which this PR extends with a hard ceiling.

Suggested labels

ai-reviewed, prove-it-clean

Suggested reviewers

  • rogeriochaves
  • Aryansharma28
  • sergioestebance

Poem

Hop hop, I streamed my voice through the wire,
no text-crumbs now to mislead the choir.
A silence tail, then pings kept time,
and my judge-bunny grin stayed bright in rhyme.
🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 67.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed The changes add the required examples-as-e2e real-voice multi-turn coverage and remove text commits.
Out of Scope Changes check ✅ Passed The diff stays focused on the ElevenLabs adapter and tests, with no obvious unrelated changes.
Title check ✅ Passed The title clearly summarizes the main change: hosted ElevenLabs now defaults to real audio input instead of text commits.
Description check ✅ Passed The description is clearly about the ElevenLabs real-audio multi-turn and keepalive changes in this patch.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/705-real-voice-multiturn

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
scripts/provision_elevenlabs_agent.py (1)

82-116: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Parameterize the _turn_config return annotation.

The helper returns a heterogeneous mapping (float/str/bool values) but is annotated as a bare dict. Use a parameterized type so it survives pyright checking and matches the project's typing convention.

♻️ Proposed annotation
-def _turn_config() -> dict:
+def _turn_config() -> dict[str, float | str | bool]:

As per coding guidelines: "Prefer list[T] and dict[K, V] syntax over List[T] and Dict[K, V]" and "Code must pass pyright type checking without errors, preferably using pyright --strict".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/provision_elevenlabs_agent.py` around lines 82 - 116, The
_turn_config helper is annotated with an unparameterized dict even though it
returns a mapping of mixed value types; update the return annotation to a
parameterized dict type that matches the actual payload and aligns with pyright
strict typing. Keep the fix localized to _turn_config in
scripts/provision_elevenlabs_agent.py, and use the project’s preferred built-in
generic syntax so the annotation accurately describes the float, str, and bool
values returned.

Source: Coding guidelines

javascript/src/voice/adapters/elevenlabs.ts (1)

378-398: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Move private helpers below public methods.

streamSpeechThenSilence, sendUserMessage, and sendSilenceTail are private helpers but sit before later public APIs like receiveAudio and onMessage. As per coding guidelines, **/*.ts classes should place public methods first and private methods at the bottom.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@javascript/src/voice/adapters/elevenlabs.ts` around lines 378 - 398, Move the
private helpers in ElevenLabs adapter below the public API methods. In the
`ElevenLabs` class, relocate `streamSpeechThenSilence`, `sendUserMessage`, and
`sendSilenceTail` so they appear after public methods like `receiveAudio` and
`onMessage`, keeping all public methods first and private helpers grouped at the
bottom.

Source: Coding guidelines

python/scenario/voice/adapters/elevenlabs.py (1)

184-186: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Annotate the new counter attributes.

These public observability fields should be explicitly typed for pyright/strict-mode compatibility.

Proposed fix
-        self.text_commit_count = 0
+        self.text_commit_count: int = 0
         #: User turns committed by streaming real PCM (audio reached EL's STT).
-        self.audio_commit_count = 0
+        self.audio_commit_count: int = 0

As per coding guidelines, python/**/*.py requires explicit type annotations for class attributes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/scenario/voice/adapters/elevenlabs.py` around lines 184 - 186, The new
public counter attributes in ElevenLabs adapter are missing explicit class-level
type annotations, which breaks pyright/strict-mode expectations. Add explicit
types to the observability fields on the ElevenLabs adapter class for
text_commit_count and audio_commit_count, keeping them initialized in the same
place and matching the existing counter semantics. Use the class definition
containing these counters in elevenlabs.py so the attributes are clearly typed
and compliant with the python/**/*.py guideline.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@javascript/src/voice/adapters/__tests__/elevenlabs-real-audio.test.ts`:
- Around line 97-112: The JSDoc currently attached to emitUserTranscript
describes the behavior of driveTwoTurns instead, so move that multi-line comment
to driveTwoTurns and give emitUserTranscript a short one-line doc that matches
its role of emitting a user_transcript message. Keep the function names aligned
with their documentation so the test helpers in elevenlabs-real-audio.test.ts
are correctly described.

In `@javascript/src/voice/adapters/elevenlabs.ts`:
- Around line 361-375: The timeout/recovery guidance in ElevenLabs voice
handling needs to distinguish the new "audio" path from the legacy "silence"
fallback. Update the downstream receiveAudio timeout/error messaging in
elevenlabs.ts so that failures from the real-audio stream path point users to
the agent-level turn configuration and "audio"-specific recovery steps, rather
than only suggesting the "silence" to "text" fallback. Use the existing
streamSpeechThenSilence and receiveAudio flow to locate the message and adjust
the guidance text accordingly.
- Around line 384-387: `streamSpeechThenSilence` is incrementing
`audioCommitCount` before confirming there is actual PCM to send, so empty or
text-only chunks can be counted as real speech. Update this method in
`ElevenLabsAdapter` to validate the `data` buffer first and return early for
empty PCM, then only increment `audioCommitCount` and send the websocket payload
after non-empty audio is confirmed. Keep the existing `ws.send` path intact, but
make the commit counter reflect only real audio commits.

In `@python/scenario/voice/adapters/elevenlabs.py`:
- Around line 300-315: In the ElevenLabs adapter, the audio-mode path currently
fails later with a generic recv_audio timed out message, which hides the real
provisioning problem; update the error handling around the audio/silence branch
that calls _stream_speech_then_silence so mode-specific guidance is surfaced
when turn-taking does not behave as expected. Make the failure message
explicitly mention that "audio" requires the agent-level turn_timeout and
turn_eagerness provisioning (the provisioned-agent path) instead of implying a
fallback to the legacy hosted behavior, and keep the messaging distinct from the
"silence" and "text" paths.
- Around line 323-325: The `audio_commit_count` update in `ElevenLabsAdapter` is
counting empty PCM chunks as real commits, so adjust the commit path to skip
incrementing when `data` is empty and only send the empty/silence payload
without treating it as a real audio turn. Update the logic around the
`audio_commit_count` increment and `_ws.send(...)` call in the adapter’s audio
commit flow so observability only reflects actual streamed audio.

---

Nitpick comments:
In `@javascript/src/voice/adapters/elevenlabs.ts`:
- Around line 378-398: Move the private helpers in ElevenLabs adapter below the
public API methods. In the `ElevenLabs` class, relocate
`streamSpeechThenSilence`, `sendUserMessage`, and `sendSilenceTail` so they
appear after public methods like `receiveAudio` and `onMessage`, keeping all
public methods first and private helpers grouped at the bottom.

In `@python/scenario/voice/adapters/elevenlabs.py`:
- Around line 184-186: The new public counter attributes in ElevenLabs adapter
are missing explicit class-level type annotations, which breaks
pyright/strict-mode expectations. Add explicit types to the observability fields
on the ElevenLabs adapter class for text_commit_count and audio_commit_count,
keeping them initialized in the same place and matching the existing counter
semantics. Use the class definition containing these counters in elevenlabs.py
so the attributes are clearly typed and compliant with the python/**/*.py
guideline.

In `@scripts/provision_elevenlabs_agent.py`:
- Around line 82-116: The _turn_config helper is annotated with an
unparameterized dict even though it returns a mapping of mixed value types;
update the return annotation to a parameterized dict type that matches the
actual payload and aligns with pyright strict typing. Keep the fix localized to
_turn_config in scripts/provision_elevenlabs_agent.py, and use the project’s
preferred built-in generic syntax so the annotation accurately describes the
float, str, and bool values returned.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 62c94239-438b-4e36-a911-3d4bf3a23e6a

📥 Commits

Reviewing files that changed from the base of the PR and between 0e12e59 and 58d524e.

📒 Files selected for processing (8)
  • javascript/examples/vitest/tests/voice/elevenlabs-705-real-audio-multiturn.test.ts
  • javascript/src/voice/adapters/__tests__/elevenlabs-real-audio.test.ts
  • javascript/src/voice/adapters/elevenlabs.ts
  • python/examples/voice/elevenlabs_hosted.py
  • python/scenario/voice/adapters/elevenlabs.py
  • python/tests/voice/test_elevenlabs_hosted_e2e.py
  • python/tests/voice/test_elevenlabs_turn_commit.py
  • scripts/provision_elevenlabs_agent.py

Comment thread javascript/src/voice/adapters/elevenlabs.ts Outdated
Comment thread javascript/src/voice/adapters/elevenlabs.ts Outdated
Comment thread python/scenario/voice/adapters/elevenlabs.py Outdated
Comment thread python/scenario/voice/adapters/elevenlabs.py Outdated
@drewdrewthis drewdrewthis force-pushed the fix/705-real-voice-multiturn branch from 58d524e to 5a8c184 Compare June 25, 2026 13:29
@drewdrewthis drewdrewthis changed the title spike(voice/#705): real voice-in multi-turn on hosted ElevenLabs — harness + audio mode (live validation pending creds) fix(voice/#705): real voice-in by default for hosted ElevenLabs — drop turnCommitMode + receiveAudio backstop Jun 25, 2026
drewdrewthis pushed a commit that referenced this pull request Jun 25, 2026
… guards/coverage

Addresses /review on #707:
- elevenlabs.ts: clarify the keepalive hard-ceiling comment (it is never reset by
  pings, so it always bounds wall-clock; sizing rationale spelled out) and make
  the timeout error accurate when the agent is keepalive-pinging (not "silent").
- shape-guard: fix a now-backwards AC8 comment (post-#705 the adapter streams PCM,
  it does not send user_message); strip /* */ block comments too before the
  negative user_message assertion (was // only — fragile).
- real-audio test: add the silenceTailBytes constructor-validation coverage that
  was lost when elevenlabs-turn-commit.test.ts was deleted.

No behavior change. Design-soundness review verified the silence-tail + ceiling
are not reinventing EL SDK built-ins (EL ships no end-of-turn client event).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
javascript/src/voice/adapters/elevenlabs.ts (1)

319-340: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Keep the private helpers below the public API section.

streamSpeechThenSilence() and sendSilenceTail() now split sendAudio() from the public receiveAudio() / onMessage() methods. Move both helpers below the public methods so the class keeps its public surface grouped together.

As per coding guidelines, **/*.ts: In TypeScript classes, place public methods first, private methods at the bottom, and group related methods together.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@javascript/src/voice/adapters/elevenlabs.ts` around lines 319 - 340, The
class in elevenlabs.ts has private helpers grouped above the public API, which
breaks the TypeScript method ordering convention. Move streamSpeechThenSilence()
and sendSilenceTail() below the public methods such as receiveAudio() and
onMessage(), keeping the public surface together and the private helper methods
at the bottom while preserving their current behavior and call sites.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@javascript/examples/vitest/tests/voice/elevenlabs-705-real-audio-multiturn.test.ts`:
- Around line 52-72: The helper in assertRealVoiceMultiTurn is too permissive:
it only logs result.success, assumes a fixed minimum audioCommitCount, and
checks only the latest lastUserTranscript. Update it to assert the judged run
actually succeeded, make the expected scripted user-turn count explicit per
pattern instead of hardcoding >=2, and extend the ElevenLabs adapter contract in
voice/adapters/elevenlabs.ts to retain per-turn transcript/commit history so the
test can verify each scripted user turn individually.

In `@javascript/src/voice/__tests__/elevenlabs-hosted-shape.guard.test.ts`:
- Around line 129-136: The shape guard test is only stripping line comments
before asserting on user_message, so block/JSDoc comments in elevenlabs.ts can
still cause false failures. Update the test in
elevenlabs-hosted-shape.guard.test.ts to remove block comments as well before
the user_message check, using the existing ADAPTER_FILE and codeOnly logic, so
the assertion reflects executable code only.

In `@javascript/src/voice/adapters/elevenlabs.ts`:
- Around line 353-357: The hard wall-clock limit in elevenlabs.ts is still tied
to the caller-provided timeout via hardCeilingMs, which allows pinging sockets
to exceed the intended 45-second ceiling. Update the timeout calculation in the
keepalive/hard-timer logic so the hard ceiling is always fixed at 45 seconds
(use the existing KEEPALIVE_HARD_CEILING_S constant directly for the absolute
ceiling), while preserving the separate idle deadline behavior driven by
timeout.

---

Nitpick comments:
In `@javascript/src/voice/adapters/elevenlabs.ts`:
- Around line 319-340: The class in elevenlabs.ts has private helpers grouped
above the public API, which breaks the TypeScript method ordering convention.
Move streamSpeechThenSilence() and sendSilenceTail() below the public methods
such as receiveAudio() and onMessage(), keeping the public surface together and
the private helper methods at the bottom while preserving their current behavior
and call sites.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cbc8923b-f923-4a56-b5b4-63d30c0b5559

📥 Commits

Reviewing files that changed from the base of the PR and between 58d524e and 4968861.

📒 Files selected for processing (6)
  • javascript/examples/vitest/tests/voice/elevenlabs-705-real-audio-multiturn.test.ts
  • javascript/src/voice/__tests__/elevenlabs-hosted-shape.guard.test.ts
  • javascript/src/voice/__tests__/elevenlabs-turn-commit.test.ts
  • javascript/src/voice/adapters/__tests__/elevenlabs-real-audio.test.ts
  • javascript/src/voice/adapters/__tests__/elevenlabs.test.ts
  • javascript/src/voice/adapters/elevenlabs.ts
💤 Files with no reviewable changes (1)
  • javascript/src/voice/tests/elevenlabs-turn-commit.test.ts

Comment thread javascript/src/voice/adapters/elevenlabs.ts Outdated
@drewdrewthis

drewdrewthis commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

Review verdict: READY

Ready to ship as an incremental improvement with documented limitations. This supersedes the earlier NOT-READY, which applied a "flawless" bar; the applicable bar (maintainer's decision) is ship the verified improvement over the current published state, track the rough edges as follow-ups.

Delivered + verified at this HEAD (24765b4)

  • Real multi-turn voice via proceed() against hosted ElevenLabs, driven by a speech-native realtime user — proven by repro-705-proceed-multiturn (proceed(4) → 4 voiced user + 5 agent turns, clean alternation; proceed(3)+judge coherent) and 881 keyless unit tests green.
  • Two data-verified defect fixes: missing agent (AUT) transcript on the LangWatch snapshot; doubled user-sim turns.
  • voiceProceed removed from the kitchen-sink demo (TypeScript-only parity divergence; SDK-level removal tracked in Remove the TypeScript-only voiceProceed script verb (restore Python↔TS parity) #714).
  • CI green (12 pass / 4 skip / 0 fail); MERGEABLE; write-pr format + proof sections present; all 8 review threads resolved.

Shipped WITH these documented, tracked limitations (non-blocking, maintainer-accepted)

Human-in-the-loop

Assignee set; human review requested. NOTE: the prior human approval (sergioestebance) was against the earlier fail-loud revision; this HEAD reverses that direction (wires the realtime-user path in) — flagged in the PR body as a re-review request.

Merge decision is the maintainer's.

@drewdrewthis drewdrewthis force-pushed the fix/705-real-voice-multiturn branch from d075296 to 6ddb953 Compare June 25, 2026 14:52
Stream the user's real PCM every turn (drop the text-commit path) so the hosted agent's STT/VAD/turn-taking run on turns 2+. A 1.5s trailing silence tail closes scripted turns on a vanilla agent — no agent-side turn config needed. A keepalive hard-ceiling bounds the no-audio ping wait so receiveAudio cannot hang.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@drewdrewthis drewdrewthis force-pushed the fix/705-real-voice-multiturn branch from 6ddb953 to 09142a4 Compare June 25, 2026 15:00
@drewdrewthis drewdrewthis changed the title fix(voice/#705): real voice-in by default for hosted ElevenLabs — drop turnCommitMode + receiveAudio backstop fix(voice): real voice-in by default for hosted ElevenLabs ConvAI Jun 25, 2026
Pattern 1 replaced the inert proceed(1) step (which resumed a half-consumed turn and ran the judge, never voicing a user turn) with an explicit unscripted scenario.user() + scenario.agent() pair. The unscripted user() makes the simulator generate its own next line via the LLM and voiceify it; the following agent() flushes that generated PCM to ElevenLabs, incrementing audioCommitCount to 3 (beyond the 2 scripted turns) and yielding a fresh STT user_transcript. Assertions now require >=3 commits and a non-empty lastUserTranscript, with messages stating this proves autonomous voice-to-voice.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@drewdrewthis drewdrewthis force-pushed the fix/705-real-voice-multiturn branch from f7c8578 to a02132b Compare June 25, 2026 15:41
@drewdrewthis drewdrewthis self-assigned this Jun 25, 2026
Ubuntu and others added 5 commits June 25, 2026 16:26
… bug

Documents the pre-fix wire shape -- a user_message text-commit carrying no
user_audio_chunk, so EL STT never ran on turns 2+ -- and asserts the current
adapter never emits it. Pairs with the existing fix-path guard, which fails
when the real-audio path is reverted.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… in-turn

The "do you offer support on weekends?" phrasing consistently made the hosted
EL agent hand off / end its turn (receiveAudio timed out — 0/3 across runs).
An in-domain hours question passes 2/2 (audioCommits=3, real STT, judge pass).
Confirms the failure was test content, not the transport.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… realtime user

Make hosted ElevenLabs multi-turn voice work through the scenario API, no hacks.
Both verified live (3 consecutive runs each) and via deterministic unit tests.

- proceed() now drives real voiced multi-turn: callAgent voiceifies the
  generated user-sim turn (it was broadcast as TEXT, so EL received no audio
  and the next agent turn timed out). voiceifyGeneratedUserTurn reuses the same
  voiceifyText pipeline as the scripted user("...") path, gated on USER role +
  a voice adapter + a voice user-sim. Live: proceed(4) -> 3 user + 4 agent.

- Realtime user via user(): new OpenAIRealtimeAgentAdapter.speakUserTurn speaks
  the scripted line VERBATIM via an out-of-band response.create
  (conversation:"none"), instead of the default sendText which makes the model
  GENERATE a reply (the user then sounds like the agent). Its spoken audio +
  transcript feed the agent under test. Live: 3 user + 4 agent, coherent.

Deterministic: 421 keyless tests pass; tsc clean. Python parity is a follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…er turn

A bare scenario.user() let the gpt-5-mini sim emit a city-less opener; the
agent (per its "do not guess the city" prompt) then asked for clarification
instead of calling get_current_weather, intermittently failing the hasToolCall
assertion and reddening ci-checks (4/4 retries in the CI run). Unrelated to the
voice work — the voice diff cannot reach this pure-text path. A deterministic
"weather in Barcelona" opener makes the agent call the tool every run; verified
3/3 green locally.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… two stale docs

Code review of #707 surfaced one real correctness bug + two docs that
contradicted the code:

- speakUserTurn (openai-realtime.ts) read this.lastAgentTranscript as its
  transcript fallback but never reset it; on turn 2+ with audio-but-no-transcript
  it returned the PREVIOUS turn's line, defeating the verbatim isolation this
  method exists to provide. Reset to null at the top so the `?? text` fallback
  reads only a transcript produced by THIS turn.
- findRealtimeUserAgent doc (scenario-execution.ts) still named sendText; the
  executor routes through speakUserTurn now.
- openai-realtime-speak-user-turn unit-test header described emitting
  conversation.item.create; the test asserts it is NOT emitted (the #705
  isolation contract — emitting it would make the model answer the line).

Typecheck clean; 149 adapter unit tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011L8sGpz4ph1ryu66UgRF9T
@drewdrewthis

Copy link
Copy Markdown
Collaborator Author

Live evidence — proceed() + realtime user, on the fixed HEAD

Captured a live hosted-ElevenLabs run on 7d0d02d. Test Files 2 passed (2), Tests 3 passed (3), real exit code 0 (the exit is honest — no | tee swallow, unlike the #708 CI step).

1. proceed(4) drives real multi-turn voice — the customer's actual driver

✓ tests/voice/repro-705-proceed-multiturn.test.ts (2 tests) 57229ms
    ✓ proceed(4) drives >=3 voiced user turns + >=3 agent replies (hosted EL, no judge)  35193ms
    ✓ proceed(3) + judge() completes the full 'scenario API as is' shape (hosted EL)     22035ms

proceed(4) streamed ≥3 voiced user turns and got ≥3 agent replies over the live EL WebSocket — the path that produced only audioCommits=1 before the voiceifyGeneratedUserTurn fix (the generated user turn was broadcast as text, so EL's STT never ran on turns 2+).

Note on proceed(N) + judge(): [repro#705 judged] UTTERANCE COUNTS → user audio turns=1, agent replies=2, success=true. When a per-turn judge() yields a verdict it ends the run (inherent, Python-identical) — so the judged shape returns as soon as the judge is satisfied. For a guaranteed N voiced turns, use proceed(N) without a per-turn judge (the first test) or stricter criteria.

2. Realtime (speech-native) USER drives hosted EL — coherent 3 user + 4 agent turns

[#705] segments=7 userTurns=3 agentTurns=4 success=true
[#705]   seg0 agent 1.50s bytes=71808  transcript="Hello, how can I help you today?"
[#705]   seg1 user  2.60s bytes=124800 transcript="Hi, I have a question about my account."
[#705]   seg2 agent 2.31s bytes=110704 transcript="Certainly. What would you like to know about your account?"
[#705]   seg3 user  2.70s bytes=129600 transcript="Thanks — can you tell me your support hours?"
[#705]   seg4 agent 4.61s bytes=221408 transcript="Our support team is available Monday through Friday from 9 AM to 5 PM Eastern Time."
[#705]   seg5 user  2.90s bytes=139200 transcript="Got it. One more — how do I reset my password?"
[#705]   seg6 agent 5.64s bytes=270776 transcript="You can reset your password by visiting our website and clicking the \"Forgot Password\" link on the login page."

A realtime OpenAI user (role=USER) speaks each line verbatim (via speakUserTurn's out-of-band response.createconversation:"none", input:[], output_modalities:["audio"]), the hosted EL agent answers coherently, real PCM flows both ways (bytes= per segment). Judge success=true. This run exercises the lastAgentTranscript fix across 3 user turns with correct per-turn transcripts (no stale leak).

What gates vs what demonstrates

🤖 Captured with Claude Code

A realtime USER agent (OpenAI Realtime, role=USER) speaks scripted lines
verbatim via `speakUserTurn` (the `user("...")` route). Driven by
proceed()/autonomous generation it routes through `call()` instead — which,
with the realtime session's `turn_detection:null` and no out-of-band
`response.create`, yields no spoken user turn — and the executor would
broadcast that as an empty/text user turn: a silent voice→text substitution
on the user side, exactly what we never ship.

Guard `voiceifyGeneratedUserTurn`: when a voice agent is under test and the
producer is a realtime user agent with NO voiceify channel, throw a clear,
actionable error pointing at the supported path (scripted `user()`, or a
voice user-simulator for autonomous voiced turns) instead of degrading
silently. Scripted realtime-user turns route through `speakUserTurn` and
never reach this method, so they are unaffected.

Keyless test (offline): proceed(1)+realtime-user throws; scripted
user()+realtime-user does not. 379 keyless tests pass; tsc clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011L8sGpz4ph1ryu66UgRF9T
Ubuntu and others added 2 commits June 26, 2026 09:23
…ew fixes)

Scoped re-review (5 reviewers + 4 personas) found the prior executor-only
guard fired too late: for the real OpenAI realtime adapter, proceed() calls
`agent.call()` -> defaultVoiceCall FIRST, which (turn_detection:null, no
out-of-band response.create) blocks on a `receiveAudio` that times out
*before* the executor guard runs — so the clean guidance never surfaced in
production. The keyless executor test masked this (its fake `call()` returns
instantly). Convergent finding: Uncle Bob, Fowler, and design-soundness.

Fix — two layers, fail closed:
- PRIMARY: OpenAIRealtimeAgentAdapter.call() rejects at the top when
  role=USER. A realtime user's scripted turns route through `speakUserTurn`
  (the executor add+broadcasts WITHOUT call()), so reaching call() with
  role=USER means autonomous/proceed drive — unsupported. Fires before any
  network, so the message actually surfaces. Faithful keyless test (no
  connect/keys needed).
- BACKSTOP: the executor guard stays but is simplified to FAIL CLOSED
  (`isRealtimeUserAgent(producer)` — dropped the `!isVoiceUserSim` term that
  was fail-OPEN: it would have routed a future realtime+voiceify hybrid into
  TTS, the exact silent voice->text substitution we never ship). Catches any
  other realtime-user adapter that does not self-reject. Its test is reframed
  as a backstop test (fake = a non-self-rejecting adapter).
- Shared message const REALTIME_USER_AUTONOMOUS_UNSUPPORTED (domain/agents)
  so the two sites can't drift; tests assert the const, not a substring.
- Added @throws to voiceifyGeneratedUserTurn; trimmed its comment.

tsc clean; 381 keyless tests pass (4 guard tests across the two layers).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011L8sGpz4ph1ryu66UgRF9T
Non-blocking polish from the scoped re-review (principles + test reviewers,
no Must-Fix):
- De-duplicate the guard rationale: the shared "why" lives on
  REALTIME_USER_AUTONOMOUS_UNSUPPORTED; the two guard sites now carry only
  their site-specific reason (adapter: reject before defaultVoiceCall times
  out; executor: backstop for a non-self-rejecting adapter).
- Make the executor backstop comment precise: it converts a silent TEXT
  broadcast into a loud failure, and the order (before the `!isVoiceUserSim`
  return) is LOAD-BEARING for a hypothetical dual-shape adapter — noted so a
  future reorder can't silently reintroduce the substitution.
- Rename the role=AGENT adapter test to state both halves; replace its
  try/catch with a `.catch()` one-liner.

tsc clean; 381 keyless tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011L8sGpz4ph1ryu66UgRF9T
@drewdrewthis

drewdrewthis commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator Author

Review Brief — fix(voice): real voice-in by default for hosted ElevenLabs ConvAI (#707)

Attention map, not a verdict. Gating verdict: review-verdict comment — READY.

Mode: Targeted Review · Scariest thing: the fixed 1.5 s silence tail decides every turn's end-of-turn — too short for an agent's VAD and the turn never closes, hanging to the 45 s ceiling then failing, with no text fallback left · Skimmable: No

30-second cockpit

  • Start here: elevenlabs.ts:76,93 + the wait-ceiling math in receiveAudio:366 — the hardcoded 1.5 s tail + 45 s ceiling that now govern turn closure for every hosted-EL voice run.
  • Why it matters: the text path is deleted, so turn closure depends entirely on EL's VAD firing on a fixed silence tail. If a customer agent's turn_timeout exceeds 1.5 s, the scripted turn never closes → hangs to 45 s → fails as an opaque timeout. No fallback, no clear "tail too short" signal.
  • Evidence: keyless fake-socket guards + regression baseline are proven and ci-checks (24.x) is green on HEAD; the headline live multi-turn behavior is manual + non-gating (#708), not a reproducible CI run.
  • Your next move: decide whether green keyless guards + manual live runs suffice to merge, or require one green live run first — then walk the 3 inspection targets below.

Key decisions & consequences

Decision (chosen) Alternative not taken Consequence / blast radius (prod + arch) Your check
Delete the text-commit path + turnCommitMode; sendAudio always streams real PCM + a silence tailelevenlabs.ts sendAudio:302-322 (drops the {"type":"user_message","text":…} commit; deletes the -335 turn-commit test file) #705 owner's option A: keep turnCommitMode:"text" as default + a loud runtime warning + capability flag (#705) Prod: every hosted-EL voice user now exercises real ASR/VAD on turns 2+; no text fallback — an agent that "passed" on the text path can now hang if its VAD never closes the turn. Arch: two paths collapse to one; removes a config knob (breaking for anyone who set turnCommitMode); real-audio is now the contract — re-adding text would re-create the #705 bug surface. Is turnCommitMode referenced in any published API / docs / customer config? If so this is a silent breaking change.
Fixed 1.5 s silence tail + 45 s keepalive hard-ceilingSILENCE_TAIL_BYTES=72000 :76, wait bounded at max(timeout, 45s), pings don't re-arm — receiveAudio:366 Configure EL's per-conversation turn_timeout (1–30 s) / turn_eagerness, or EL's first-party Simulate-Conversation API — both flagged settable in #705 but not pursued Prod (the scariest chain): tail is tuned so a vanilla agent's default VAD closes the turn with no agent-side provisioning. An agent with turn_timeout > 1.5 s or eager turn-taking off → 1.5 s silence never trips end-of-turn → turn never closes → receiveAudio blocks to the 45 s ceiling → run fails as a timeout (not "tail too short"). No auto-recovery; only a constructor silenceTailBytes override exists. Arch: turn-closure timing is a hardcoded constant, not derived from the agent's actual EL turn config — any agent needing > 1.5 s forces editing the const or threading a per-scenario tail. Were 72000 / 45 s validated against more than one hosted-EL agent config (varying turn_timeout/turn_eagerness), or tuned to the single test agent? What does a too-short tail surface as to the user?
Voiceify the generated/proceed() USER turn — gated, with a fail-closed realtime-user backstopvoiceifyGeneratedUserTurn:1276-1298, call-site callAgent:1031; gate = USER role + voice adapter present + voice-user-sim producer + not-already-audio Leave proceed() broadcasting the generated turn as text (the status-quo that was the #705 bug); or voiceify unconditionally (mis-handles realtime users, double-encodes audio turns) Prod: this is THE line that fixes the customer's bug — before it, proceed()-driven user turns reached EL as text → audioCommits=1 → the next agent turn timed out; after, every proceed() user turn streams voice. Blast radius = the whole proceed() path on any voice run. Arch: adds a fail-closed seam; any future user-sim adapter shape must be a voice-user-sim (gets voiceified) or self-reject — silence is now a defect. Can a producer that is neither a realtime-user (throws) nor an isVoiceUserSim (voiceified) fall through to the unchanged return messages = text → a silent #705 regression? How robust is the isVoiceUserSim duck-type?
Shared REALTIME_USER_AUTONOMOUS_UNSUPPORTED const + two-layer guard (adapter primary, executor backstop) — const agent-shapes.ts:36; adapter call() rejects role=USER before any network — openai-realtime.ts:613-614; executor throw is the backstop — scenario-execution.ts:1294 The prior executor-only guard — fired too late: the real adapter's defaultVoiceCall receiveAudio timed out before the guard ran (the keyless fake call() returned instantly and masked it); or silently degrade the autonomous realtime user to text Prod: autonomous/proceed() drive of a realtime USER now fails loud with actionable guidance (points at scripted user() / a voice user-sim) instead of a mystery receiveAudio timeout. Arch: two guard sites + one message const; reordering the executor backstop after the !isVoiceUserSim return silently reintroduces the voice→text substitution — flagged LOAD-BEARING ORDER in-code. Autonomous realtime-user remains out of scope (#711). Light — well-covered by 4 keyless guard tests. Confirm both sites assert the shared const (no substring drift) and the adapter-level primary is what now fires first.

Inspection path (ranked)

  1. elevenlabs.ts:302-417 — real-audio-only sendAudio + the 1.5 s tail / 45 s ceiling in receiveAudio. Check: is the fixed tail robust across agent VAD configs, or tuned to the one test agent? This is the scariest surface.
  2. scenario-execution.ts:1276-1298 — the voiceifyGeneratedUserTurn gate (fix(voice): hosted EL ConvAI multi-turn silently drops user audio (text-commit) — bypasses the agent's ASR; re-evaluate keeping text as default #705 fix) + fail-closed backstop. Check: can a generated USER turn slip all four conditions and reach EL as text again (the original symptom)?
  3. openai-realtime.ts:605-614 (call() role=USER reject) + :738-771 (speakUserTurn) + :804 (out-of-band response.create, conversation:"none"). Check: verbatim speak emits no conversation.item.create; the lastAgentTranscript reset (a bug fixed mid-PR) holds.
  4. The -335/+256 test swap — coverage migration spot-check (see claim-vs-evidence row 5).

Axis map

  • Intent: Fixes #705proceed()-driven hosted-EL multi-turn received no real audio (audioCommits=1, receiveAudio timed out). The diff matches: layer 1 (adapter real-audio) + layer 2 (executor voiceify) target it directly; layer 3 (speech-native realtime user) is adjacent scope the author explicitly bundles and bounds (autonomous drive → follow-up #711). Scope creep is declared, not hidden.
  • Correctness: the core risk is end-of-turn timing — closure rests on EL's VAD firing on a fixed 1.5 s tail. The voiceify gate is fail-closed; the realtime path uses an out-of-band response.create (verbatim) proven not to emit conversation.item.create. A real correctness trap already surfaced and was fixed mid-PR: speakUserTurn read lastAgentTranscript without resetting it, so turn 2+ replayed the previous turn's line — exactly the isolation the method exists to provide.
  • Architecture: blast radius = the now-default behavior for every hosted-EL voice user (text path deleted) + the entire proceed() path for voice runs. Future constraints imposed: real-audio is the contract; turnCommitMode is gone (breaking if published); turn-closure timing is a hardcoded constant, not derived from the agent's EL turn config; any new user-sim adapter must be a voice-user-sim or self-reject; the executor backstop's order is load-bearing for a hypothetical dual-shape adapter.
  • Risk (prod): chain ① a customer agent whose VAD needs > 1.5 s of silence → scripted turn never closes → receiveAudio blocks to the 45 s ceiling → run fails as an opaque timeout → no text fallback → human must hand-edit the tail. chain ② a generated USER turn slips all four gate conditions → broadcast as text → EL receives no audio → fix(voice): hosted EL ConvAI multi-turn silently drops user audio (text-commit) — bypasses the agent's ASR; re-evaluate keeping text as default #705 regresses silently. Both are timing/classification failures, not crashes — they fail quiet or slow, the worst kind for a test harness.
  • Maintainability: strong inline docs on the constants + explicit LOAD-BEARING-ORDER comments; the shared const prevents guard drift. Two debt notes: the live javascript-voice-integration job is hollow-green (#708) so future live regressions won't redden CI — live debuggability rests on manual runs; and the -335/+256 test swap needs a coverage check (below). Python parity (scenario_executor.py:_call_agent) is a declared follow-up.

Claim vs evidence

Claim Evidence (proven) Missing / weak Your check
Real multi-turn voice-in works on hosted EL keyless fake-socket suite (elevenlabs-real-audio.test.ts): turn 2 streams real user_audio_chunk, emits NO user_message, STT assertion + before→after regression baseline (reverting the real-audio path fails the guard); ci-checks green on HEAD No reproducible live run in gating CIjavascript-voice-integration.yml is non-gating (a | tee swallows vitest's exit; ambient EL flakiness, #708). Live proof is manual + ambient-flaky accept manual + local live proof, or require one green live run before merge — a policy call, not a code defect
proceed() now streams voice (the customer's actual bug) voiceifyGeneratedUserTurn + the before→after regression baseline (keyless) live proceed(4) proof is local only (live-evidence comment); the gate's fall-through-to-text branch has no test proving a non-voice-sim / non-realtime producer can't slip through spot-check the gate at :1281-1298
Realtime user speaks the scripted line verbatim (not generate-a-reply) openai-realtime-speak-user-turn.test.ts isolation contract — asserts the out-of-band shape and that no conversation.item.create is emitted; the lastAgentTranscript replay bug was caught + fixed in review nothing material — keyless-proven fast skim
Autonomous realtime-user fails loud, never silent voice→text two-layer guard, 4 keyless tests (adapter + executor); shared const asserted nothing — out-of-scope autonomous path tracked #711 fast skim; confirm the adapter-level primary guard now fires before the timeout (the prior executor-only guard did not in prod)
Deleted elevenlabs-turn-commit.test.ts (-335) coverage is preserved real-audio behaviors + regression baseline + silenceTailBytes validation migrated to elevenlabs-real-audio.test.ts the turnCommitMode-mode tests are obsolete (feature removed — fine), but the post-interrupt re-engage and user_messageuser_transcript observability cases are not obviously re-homed confirm post-interrupt re-engagement is still covered (in elevenlabs.test.ts / the hosted-shape guard) or intentionally dropped with the feature

Questions for the author

  1. Were 72000 (1.5 s tail) and the 45 s ceiling validated against more than one hosted-EL agent config (a longer turn_timeout, eager turn-taking off), or tuned to the test agent? #705 itself flagged turn_timeout/turn_eagerness as settable — was per-agent turn config weighed against a single fixed tail?
  2. In voiceifyGeneratedUserTurn:1276-1298, is there any producer shape that is neither a realtime-user (throw) nor an isVoiceUserSim (voiceify), so a generated USER turn falls through to return messages unchanged = text → a silent fix(voice): hosted EL ConvAI multi-turn silently drops user audio (text-commit) — bypasses the agent's ASR; re-evaluate keeping text as default #705 regression?
  3. The deleted elevenlabs-turn-commit.test.ts (-335): post-interrupt re-engagement and user_messageuser_transcript observability aren't obviously re-homed — preserved elsewhere, or intentionally dropped with the feature?
  4. turnCommitMode is removed — is it referenced in any published API surface / docs / customer config? Removing it is a breaking change if so.

Don't spend time on

  • The realtime loud-guard mechanics — 4 keyless tests, fail-closed, shared const; the autonomous path is deliberately out of scope (#711).
  • The speakUserTurn isolation contract — keyless-proven; the one real trap (transcript replay) is already fixed.
  • The unrelated CI de-flake commits (weather-agent concrete city, pattern-2 reword) — pure-text paths the voice diff cannot reach.
  • Test-file naming / JSDoc wording churn.

Blockers

  • No hard code blockers. ci-checks is green on HEAD and the gating verdict is READY. Two judgment calls remain, both reviewer policy rather than code defects: (1) the headline live voice-in proof is manual + non-gating (#708) — accept it, or require one green live run before merge; (2) removing turnCommitMode is a breaking change if it is a published option — confirm before merge.

References

PR #707 · issue #705 · gating verdict comment (READY) · live-evidence comment · related: #708 (CI honesty / non-gating live job) · #711 (autonomous realtime-user follow-up) · #638 (old one-exchange ceiling) · #596 (v0.4.14 text-default origin) · #567 (deleted turn-commit tests). Files: elevenlabs.ts · scenario-execution.ts · openai-realtime.ts · agent-shapes.ts. Anchored to head SHA 75588c1.

@sergioestebance

Copy link
Copy Markdown
Contributor

Ruthless review — verdict: technically merge-ready, one P2 doc fix to fold in

Reviewed at HEAD 75588c1. Validated the gate locally, not just on trust:

  • tsc --noEmit — clean
  • vitest run src855 passed, 1 skipped (matches the PR's claim)
  • ci-checks — green
  • Realtime→EL audio bridge: both adapters declare pcm16/24000, and the EL adapter warns at runtime on user_input_audio_format drift — so the realtime-user (24 kHz) → hosted-EL (24 kHz) path has no sample-rate mismatch. The bridge is internally consistent.

The three-layer fix is well-targeted: real-audio-only EL path (1.5 s tail + 45 s keepalive hard-ceiling), out-of-band verbatim speakUserTurn for the realtime user, and fail-loud guards at two layers (adapter call() primary + executor backstop) for the unsupported autonomous-realtime-user mode. Guard ordering (isRealtimeUserAgent before the !isVoiceUserSim return) is correctly fail-closed and covered by keyless tests.

P2 — two stale load-bearing comments contradict the shipped behavior

Same class as commit 7d0d02d ("two stale docs") — two more slipped through. Both describe the deleted text-commit path as if it were current:

  • javascript/src/voice/adapters/openai-realtime.ts:720-722 — "The transcript is load-bearing for the hosted EL transport: its default turnCommitMode:"text" commits the user turn via a user_message event built from the AudioChunk transcript (it does not ingest raw user audio)."
  • javascript/src/execution/scenario-execution.ts:1145-1146 — "The hosted EL transport additionally commits the turn from the chunk's transcript (turnCommitMode:"text")."

This PR removed turnCommitMode/user_message entirely. sendAudio now streams chunk.data as user_audio_chunk (EL does ingest raw audio — that's the whole fix), and lastUserTranscript is populated from EL's own inbound user_transcript STT echo, not the outgoing chunk. So on the realtime→EL bridge the chunk transcript is observability-only, not load-bearing for turn-commit. The comments invert the PR's central change and will mislead the next maintainer into "fixing" transcript handling they think gates the turn.

Fix: reword both to: the bridged AudioChunk's PCM is streamed as user_audio_chunk and EL runs its own STT; the transcript is recorded for observability/fallback only.

Nit (non-blocking)

isRealtimeUserAgent now requires sendText even though the executor routes only through speakUserTurn; sendText survives solely for the isolated adapter demo. Harmless over-specification — leave or trim, your call.

Residual risks (acknowledged in the PR, not blockers)

Nothing here blocks correctness. Recommend folding the P2 comment fix into this PR before merge, since it's already in human-review.

@sergioestebance sergioestebance left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blessed

@drewdrewthis drewdrewthis added the slack-requested Slack PR review request posted label Jun 26, 2026
@drewdrewthis

Copy link
Copy Markdown
Collaborator Author

⚠️ HOLD — do not merge. The @e2e voice suite is actually failing on this branch.

The real-voice-in multi-turn tests (elevenlabs-real-audio-multiturn, repro-705-proceed-multiturn) fail with receiveAudio timed out (the original #705 symptom — hosted agent produces no audio on turns 2+):

  • HEAD 75588c1a: 3/3 patterns failed (+ an ElevenLabs 401).
  • earlier a02132b9: 2/3 failed.

CI looks green only because the @e2e voice step runs continue-on-error (non-blocking) — the job reports success while every voice test fails. So #705 is not actually fixed; this needs a real diagnosis (and the CI should stop greening on failing voice tests) before any merge.

(Flagged by support after the failures surfaced — supersedes the green checks and the #dev review request.)

…carry agent transcripts

Wire scenario.proceed(N) to drive an autonomous OpenAI Realtime user
(role=USER) through a multi-turn voiced conversation against a hosted
ElevenLabs ConvAI agent — the faithful #705 fix. Reverses the prior
fail-loud guards: the executor now feeds EL's audio into the realtime
session and speaks a generative next user turn each proceed() step
(fork B, one in-context response.create), returned as the user's audio
turn. The type-based realtime-user guard is replaced by the
adapter-agnostic USER_TURN_NO_AUDIO_FOR_VOICE_AUT audio-presence invariant
(trips on the produced no-audio artifact regardless of producer type).

Fixes two defects found by inspecting the live LangWatch MESSAGE_SNAPSHOTs
(which diverge from the on-disk recording manifest):

- Missing agent-under-test transcript. Hosted EL streams raw PCM with the
  turn text on `agent_response` (lastAgentTranscript), but the adapter built
  the agent chunk from PCM only, so every agent turn reached the conversation
  message (and LangWatch) as audio with NO transcript — only the recording
  manifest got a lossy STT back-fill. defaultVoiceCall now (a) TURN-SCOPES the
  adapter's `lastAgentTranscript` — nulls it before a REPLY's drain (a
  no-incoming greeting, whose transcript is set on connect, is exempt) so a
  reply that emits audio but no fresh transcript cannot inherit the prior
  turn's text — and (b) attaches the native turn transcript to the merged
  chunk when it carries none, onto BOTH the message and the recording segment.
  A no-audio turn is never labeled with a transcript. (Turn-scoping in the
  shared path fixes every agent adapter uniformly, incl. OpenAI Realtime as
  the agent under test, which reaches it via super.call() — a cross-turn
  bleed an adversarial review caught.)

- Doubled user-sim turns. proceed() is USER-led, so a trailing scripted
  user() opener immediately before proceed() produced two adjacent user
  turns (the agent ingests only the latest pending audio, dropping the
  opener). The repro script now drains the agent's reply to the opener
  before proceed(), so every user turn is heard and answered. The e2e
  asserts result.messages is clean: no consecutive same-role turns and a
  transcript on every agent turn.

Tests: voice-agent-transcript (transcript wiring, adapter-agnostic, offline),
proceed-turn-count (proceed(N) drives exactly N user turns),
judge-coherence-criterion (AGENTS_HEARD_EACH_OTHER discrimination, EL-free),
no-transcript-bleed regression (a reply with a stale transcript + no fresh
one carries none), repro-705 multi-turn e2e (env-gated, live-proven). Unit
suite 881 green, tsc + eslint clean both packages.

BREAKING CHANGE: the EL adapter's SDK migration removes the public
`WebSocketLike` test-seam type (superseded by the SDK's own WebSocketFactory).
Niche, but note it for the version bump.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XgQFps8Vu3nrCaL37bvgYf
@drewdrewthis drewdrewthis force-pushed the fix/705-real-voice-multiturn branch from 3811643 to d9d5a77 Compare July 1, 2026 00:39
@drewdrewthis

Copy link
Copy Markdown
Collaborator Author

Adversarial self-review — verified findings + suggested follow-ups

A deeper adversarial pass over this branch (three independent reviewers — correctness / hygiene / test-fidelity — plus code-level verification of every major claim). The core design holds up, but the pass surfaced two verified MAJOR defects and two test-suite hollow-green hacks that the earlier review missed. Filing here so they're tracked against the diff.

Merge-base 0e12e59 → HEAD d9d5a77.

🔴 Verified defects (checked against the code, not just flagged)

F1 — every realtime user turn stalls ~15s (drain has no response.done terminator for audio turns)
OpenAIRealtimeAgentAdapter.receiveAudio returns a terminating empty chunk on response.done only for tool turns (if (this._completedToolCalls.length > 0), openai-realtime.ts:465). A normal spoken audio turn falls through, so _drainSpokenTurn (openai-realtime.ts:790) ends the turn only when receiveAudio throws its per-frame idle timeouttailTimeoutS, which defaults to 15 (speakUserTurn(text, tailTimeoutS = 15) :753; speakGeneratedUserTurn likewise; the executor calls both with the default). Net: after the model finishes speaking, the drain waits a full 15s of dead air before breaking. proceed(4) ≈ 60s of silence.
Why it wasn't caught: unit tests set tailTimeoutS=1 or inject a synthetic zero-length delta terminator (the GA API doesn't emit one); the live multiturn tests carry timeout: 300_000 + retry:2, which mask the slowness as "green, just slow."
Suggested fix: terminate the drain on response.done / response.output_audio.done when audio was received this turn (mirror the tool-turn early-return at receiveAudio:465); keep the idle timeout as a backstop. Add an offline mock that emits output_audio.done + response.done with no empty delta and asserts prompt termination.

F2 — the fail-closed invariant accepts a transcript-only turn that EL cannot commit
voiceifyGeneratedUserTurn treats audio.transcript?.length > 0 as sufficient content (scenario-execution.ts, carriesContent). But EL now commits audio-only — text-commit was removed as a regression (elevenlabs.ts:684-686; enqueueSpeech early-returns on data.length === 0; grep confirms turnCommitMode survives in comments only, no live code). So a transcript-only / empty-audio turn (the #708 audio-drop flake: response.output_audio_transcript.done arrives but audio deltas are dropped) passes the invariant → EL sendAudio(empty) → no commit → the next agent() receiveAudio times out. That is the exact #705 symptom the invariant exists to prevent, re-surfaced as an opaque EL fault.
Suggested fix: for an audio-commit AUT, require audio bytes (guard/drop the transcript-only branch). Add a guard test: transcript-only + empty-audio must throw USER_TURN_NO_AUDIO_FOR_VOICE_AUT.

Stale comments (ship a fix with F2): scenario-execution.ts:1146 and openai-realtime.ts:726 both claim EL commits the turn via turnCommitMode:"text". That mode no longer exists — the comments will mislead the next maintainer into believing a transcript alone suffices. Delete/correct them.

🟠 Verified test-suite hacks (hollow green)

Return-to-skip reports PASSED, not skipped. repro-705-proceed-multiturn.test.ts:139 and :237 (also realtime-user-hosted-el.test.ts) do if (!hasHostedKey) { console.log("SKIP…"); return; }. In vitest an early return records the test as PASSED. These are the primary proceed(N) autonomous-realtime-user regression guards, and there is no ELEVENLABS_API_KEY repo secret — so they fake-pass in every CI run. Fix: it.skipIf(!hasHostedKey)(…) (as judge-coherence-criterion.test.ts:29 already does) so they report as honestly skipped.

Self-fulfilling STT assertion. elevenlabs-real-audio.test.ts:228 asserts expect(adapter.lastUserTranscript, "expected an STT user_transcript").toBeTruthy() — but the test itself calls emitUserTranscript(...) to set that field, so it proves only that the event handler runs, not that audio→STT round-tripped. The load-bearing assertion in that test is audioCommitCount >= 2. Fix: split the transcript check into a separate, honestly-named test ("adapter processes user_transcript events") or drop it.

🟡 Narrower edges (flagged by review; not independently reproduced this pass)

  • F4 — realtime-as-AGENT: the per-turn transcript reset in defaultVoiceCall is duck-typed and can't clear the private _agentTranscriptBuf; a turn whose .done arrives empty after a prior turn hit maxDuration could caption turn N+1 with turn N's words. Reset _agentTranscriptBuf on the AGENT call() path too. Non-canonical config; untested.
  • F5 — the greeting-exemption reset is gated on incoming; a 2nd+ consecutive no-incoming agent() turn that emits audio without a fresh transcript could inherit the prior turn's text. The canonical fix(voice): hosted EL ConvAI multi-turn silently drops user audio (text-commit) — bypasses the agent's ASR; re-evaluate keeping text as default #705 config (every agent turn replies to user audio) is safe. Untested.
  • F6 — the scripted user() realtime path (speakUserTurnscriptCallAgent) bypasses the audio-presence invariant; an empty speakUserTurn result would hang silently. Only the proceed() path is guarded.
  • F7 — the verbatim opener uses conversation:"none", so it isn't in session history; the first proceed() autonomous turn can't "see" the line it just spoke (known persona-mirror quirk; caps proceed-turn-1 coherence).
  • F8proceed-turn-count.test.ts guards the count on a text run only; the voice progression that actually broke isn't covered offline.

⚪ Hygiene (flagged by review)

  • elevenlabs.ts uses bare console.warn + eslint-disable at 3 sites instead of the Logger.create(...) pattern its peer openai-realtime.ts uses.
  • elevenlabs.ts:818 onMessage is implicitly public (a leaked test seam); the peer pipecat.ts marks the same method private.
  • Magic numbers: openai-realtime.ts:798 bare 400 drain ceiling; tailTimeoutS = 15 default duplicated at :753/:949 — name them so they can't drift.

🟢 What held up (validated by ≥2 reviewers + code read)

  • Core design is sound: out-of-band verbatim voicing is idiomatic; the generic transcript reset in defaultVoiceCall correctly closes the realtime-as-AGENT cross-turn bleed; fail-loud-over-silent-degrade is the right stance.
  • Agent transcript reaches the LangWatch MESSAGE_SNAPSHOT end-to-end (separate text part survives the AG-UI converter into the snapshot + judge) — the original defect is genuinely closed at the snapshot layer, not just the on-disk recording.
  • proceed(N) drives exactly N turns — the earlier "proceed(4) drove 3" was a misread of the onTurn callback (fires N-1× by design); proceed-turn-count.test.ts now asserts the authoritative count with exact toBe(n).
  • The cross-turn-bleed tests, the EmptyAudioUser voice e2e CI reports success while tests fail (exit code swallowed); hosted-EL live e2e ambiently flaky #708 guard (zero-byte audio fails the real-content invariant), and the anti-vacuous assertCoherentConversation guard are genuinely load-bearing, not theater.
  • Sample-rate cleared: EL / realtime I/O / AudioChunk are all pcm16/24000.

Suggested follow-up

  1. Fix F1 + F2 (+ delete the stale comments) — both are user-visible on the realtime→EL path.
  2. Fix the two hollow-green test hacks and add one merge-blocking integration test that runs proceed() against the real OpenAIRealtimeAgentAdapter(role=USER, loopback WS) + a fake voice AUT — currently only the env-gated e2e covers the combined path, so a routing regression would pass CI.
  3. F4–F8 + hygiene as lower-priority cleanup.

Balance: the feature is functionally correct — it drives real multi-turn voice, transcripts reach LangWatch, and the turn count is right. F1 makes it slow (~15s/turn) and F2 makes it fragile under the known #708 flake, and CI doesn't yet guard the headline path — so this is short of "merge-ready" until at least F1/F2 + the test hacks land.

Ubuntu and others added 2 commits July 1, 2026 11:07
…ness assertions

Exercises the voice API surface end-to-end against a hosted ElevenLabs agent
driven by an autonomous realtime user: verbatim + autonomous user turns,
time-based barge-in, silence handling, and voiceProceed with interruptions,
closed by the coherence judge. Asserts the on-disk recording (full.wav +
manifest + per-segment WAVs, every segment transcribed). Inline comments flag
the known rough edges (interrupt placement, voiceProceed turn semantics,
short-turn segment fidelity).
…proceed()

Remove the TypeScript-only voiceProceed + InterruptionConfig from the voice
kitchen-sink demo (voiceProceed is being removed from the SDK as a Python-parity
cleanup — tracked separately). The autonomous stretch now uses base
scenario.proceed(7, logTurn).

Wire-verified, no behavior regression: both the old voiceProceed(3) and the new
proceed(7) drive ZERO autonomous realtime-USER turns (a pre-existing #705 gap,
not introduced here). Inline comments corrected to the wire truth rather than a
guessed N-sizing story. Run coherence remains subject to hosted-EL mishearing
flakiness (#708).

Verified: examples/vitest `tsc --noEmit` green; no voiceProceed/InterruptionConfig
tokens remain in the file.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.

This PR's diff could not be evaluated automatically: Diff too large for automated evaluation (271199 chars exceeds 100000-char limit). Manual review required.

This PR requires a manual review before merging.

@sergioestebance

Copy link
Copy Markdown
Contributor

Blessed. 🙏

"Let all things be done decently and in order." — 1 Corinthians 14:40

Code's sound — tsc clean, keyless src/ suite green, the fix itself is well-built. Ship it after a doc sweep: three stale comments describe the mechanism this PR deleted (turnCommitMode:"text"/user_message commit) as current, and the kitchensink test's rough-edge [2] calls proceed()-drives-the-realtime-user a no-op — the exact capability the PR delivers. Clean:

  • openai-realtime.ts:725 — transcript is no longer load-bearing for EL commit (EL now ingests raw PCM, not text)
  • scenario-execution.ts:1145 — same stale turnCommitMode:"text" claim
  • proceed-multiturn-kitchensink.test.ts rough-edge [2] — proceed() DOES drive the autonomous realtime user now; rewrite or drop

Mind the scripted-path asymmetry too (empty realtime audio → silent EL timeout, no fail-loud like the proceed path has).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

slack-requested Slack PR review request posted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(voice): hosted EL ConvAI multi-turn silently drops user audio (text-commit) — bypasses the agent's ASR; re-evaluate keeping text as default

3 participants