Skip to content

fix(web): embed agent text in voice ready event for readback#682

Merged
tiann merged 4 commits into
tiann:mainfrom
heavygee:fix/voice-readback
May 25, 2026
Merged

fix(web): embed agent text in voice ready event for readback#682
tiann merged 4 commits into
tiann:mainfrom
heavygee:fix/voice-readback

Conversation

@heavygee
Copy link
Copy Markdown
Contributor

@heavygee heavygee commented May 24, 2026

Summary

Sorry if this is the wrong seam - I might be missing context from #640.

When a coding agent finishes (ready event), the ElevenLabs voice assistant often did not read back the agent's answer. The user had to ask repeatedly; ConvAI sometimes hallucinated partial summaries.

Root cause (I think):

  1. formatReadyEvent used to tell ConvAI the summary was in "previous messages" without embedding assistant text inline.
  2. formatMessage did not format Codex/Cursor stream-json payloads, so onMessages and session history context were empty for those sessions.

This PR embeds the last speakable assistant message in the ready inject (<text>…</text>) and teaches formatMessage / extractLastAssistantSpeakable the same codex stream-json path.

Test plan

  • bun test web/src/realtime/hooks/contextFormatters.test.ts (12 tests)
  • Manual dogfood: ElevenLabs conv_4501ksdt0athfhfr189tq3jehkcq (399s, Cursor session). At ~329s, ready inject included embedded <text>…</text> with the agent's last message; at ~330s voice summarized it accurately (e.g. DB 2.43 GiB → 842 MiB) without the user re-prompting.
  • Regression: Claude session voice still works

Issues

Fixes #681

Notes

Voice onReady now extracts the last speakable assistant message and
embeds it in the ready inject so ConvAI can summarize without the user
re-prompting. Also formats Codex/Cursor stream-json messages for live
context updates and session history.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace jellybot/subtitle dogfood strings in tests with generic examples.

Co-authored-by: Cursor <cursoragent@cursor.com>
@heavygee heavygee force-pushed the fix/voice-readback branch from cfbb3b7 to 63c3732 Compare May 24, 2026 20:40
@heavygee
Copy link
Copy Markdown
Contributor Author

Follow-up for #681 — thanks to HAPI Bot for pointing at the exact call sites.

This is intentionally small: four files in web/src/realtime/hooks/ only. Test plan includes a Cursor-session ElevenLabs dogfood conv where ready inject carried embedded <text>…</text> and voice read back without a re-prompt.

If anything should move into #640 instead, happy to split or close — just let me know.

@heavygee heavygee marked this pull request as ready for review May 24, 2026 20:43
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Preserve tool-call context for mixed Claude content arrays — formatMessage now calls extractSpeakableFromContent(content) before the existing content-array formatter. Since that helper also returns joined text for arrays, a Claude assistant payload with both text and tool_use blocks returns before the loop that emits Claude Code is using ..., so voice context drops tool-call updates that were previously sent. Evidence: web/src/realtime/hooks/contextFormatters.ts:103.
    Suggested fix:
    const speakable = !isContentArray(content)
        ? extractSpeakableFromContent(content)
        : null
    if (speakable) {
        const roleForFormat = normalizedRole === 'user' ? 'user' : 'assistant'
        return formatPlainText(roleForFormat, speakable)
    }

Summary

  • Review mode: initial
  • One regression found in voice context formatting for mixed assistant text/tool-use arrays.

Testing

  • Not run (automation): bun is not installed in this runner (bun: command not found).

return null
}

const speakable = extractSpeakableFromContent(content)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] Preserve tool-call context for mixed Claude content arrays

formatMessage now calls extractSpeakableFromContent(content) before the existing content-array loop. Because that helper also returns joined text for arrays, a Claude assistant payload like [{ type: 'text', ... }, { type: 'tool_use', ... }] returns at line 106 and never formats the tool_use item. That regresses voice context by dropping tool-call updates that were previously emitted by the loop.

Suggested fix:

const speakable = !isContentArray(content)
    ? extractSpeakableFromContent(content)
    : null
if (speakable) {
    const roleForFormat = normalizedRole === 'user' ? 'user' : 'assistant'
    return formatPlainText(roleForFormat, speakable)
}

Add a regression test with a mixed text + tool_use array asserting both the text and tool-call lines are present.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 818bf7c. Guarded the call with !isContentArray(content) so arrays fall through to the existing loop as before. Added a regression test with a mixed text+tool_use payload — was red without the guard, green after.

…essage

extractSpeakableFromContent also handles content arrays (joins text items),
so calling it unconditionally before the existing array loop caused mixed
text+tool_use payloads to return early without formatting the tool_use item.
Guard with !isContentArray so the loop handles arrays as before.

Adds regression test: mixed text+tool_use array must produce both the text
and the tool-call line (was red before this fix).

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Status events can be spoken as the last assistant answer — extractSpeakableFromContent now accepts any { data: { type: 'message', message } } envelope, not just Codex payloads. Existing sendSessionEvent({ type: 'message', message }) emits { role: 'agent', content: { type: 'event', data: ... } }, so abort/retry/failure status events can enter voice context as Claude Code: and can be selected by extractLastAssistantSpeakable for the ready readback instead of the actual final assistant response. Evidence: web/src/realtime/hooks/contextFormatters.ts:152, related producer cli/src/api/apiSession.ts:541.
    Suggested fix:
    if (isObject(content) && content.type === 'codex' && isObject(content.data)) {
        const data = content.data
        if (data.type === 'message' && typeof data.message === 'string' && data.message.trim()) {
            return data.message.trim()
        }
    }

Questions

  • None.

Summary

  • Review mode: initial
  • One issue found: Codex message extraction is too broad and can turn agent status events into spoken assistant answers.

Testing

  • Not run: bun test web/src/realtime/hooks/contextFormatters.test.ts failed because bun is not available in this runner.

HAPI Bot

}

// Codex / stream-json agent messages: { type: 'codex', data: { type: 'message', message: '...' } }
if (isObject(content) && typeof content.type === 'string' && isObject(content.data)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] extractSpeakableFromContent now accepts any object whose data.type is 'message', but existing session status events use { type: 'event', data: { type: 'message', message } } via sendSessionEvent. That means abort/retry/failure status messages can be formatted as Claude Code: and can become the ready readback text instead of the actual assistant answer.

Suggested fix:

if (isObject(content) && content.type === 'codex' && isObject(content.data)) {
    const data = content.data
    if (data.type === 'message' && typeof data.message === 'string' && data.message.trim()) {
        return data.message.trim()
    }
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 09e9b55. Narrowed to content.type === 'codex' — as the comment already said it should be. Added regression test: { type: 'event', data: { type: 'message', message: '...' } } must return null from formatMessage (was red before the narrowing, green after).

The helper matched any object with a string type and a data property,
so sendSessionEvent({ type: 'message', message }) events (which arrive as
{ type: 'event', data: { type: 'message', message } }) were falsely formatted
as speakable assistant text and could be selected as the ready readback.

Narrow the Codex path to content.type === 'codex' as the comment already states.
Adds regression test: session status event must return null from formatMessage.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • None.

Questions

  • None.

Summary

  • Review mode: follow-up after new commits
  • No high-confidence issues found in the latest diff. The follow-up change narrows Codex message extraction to type: 'codex' and adds coverage for session status events not being spoken as assistant text. Residual risk: tests could not be run in this runner because bun is unavailable.

Testing

  • Not run (automation): bun test web/src/realtime/hooks/contextFormatters.test.ts failed with bun: command not found.

HAPI Bot

@tiann tiann merged commit 3258c52 into tiann:main May 25, 2026
3 checks passed
@heavygee
Copy link
Copy Markdown
Contributor Author

heavygee commented May 25, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(voice): agent completion readback unreliable (all agents; worst on stream-json flavors)

2 participants