fix(web): embed agent text in voice ready event for readback#682
Conversation
Voice onReady now extracts the last speakable assistant message and embeds it in the ready inject so ConvAI can summarize without the user re-prompting. Also formats Codex/Cursor stream-json messages for live context updates and session history. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace jellybot/subtitle dogfood strings in tests with generic examples. Co-authored-by: Cursor <cursoragent@cursor.com>
cfbb3b7 to
63c3732
Compare
|
Follow-up for #681 — thanks to HAPI Bot for pointing at the exact call sites. This is intentionally small: four files in If anything should move into #640 instead, happy to split or close — just let me know. |
There was a problem hiding this comment.
Findings
- [Major] Preserve tool-call context for mixed Claude content arrays —
formatMessagenow callsextractSpeakableFromContent(content)before the existing content-array formatter. Since that helper also returns joined text for arrays, a Claude assistant payload with bothtextandtool_useblocks returns before the loop that emitsClaude Code is using ..., so voice context drops tool-call updates that were previously sent. Evidence:web/src/realtime/hooks/contextFormatters.ts:103.
Suggested fix:const speakable = !isContentArray(content) ? extractSpeakableFromContent(content) : null if (speakable) { const roleForFormat = normalizedRole === 'user' ? 'user' : 'assistant' return formatPlainText(roleForFormat, speakable) }
Summary
- Review mode: initial
- One regression found in voice context formatting for mixed assistant text/tool-use arrays.
Testing
- Not run (automation):
bunis not installed in this runner (bun: command not found).
| return null | ||
| } | ||
|
|
||
| const speakable = extractSpeakableFromContent(content) |
There was a problem hiding this comment.
[MAJOR] Preserve tool-call context for mixed Claude content arrays
formatMessage now calls extractSpeakableFromContent(content) before the existing content-array loop. Because that helper also returns joined text for arrays, a Claude assistant payload like [{ type: 'text', ... }, { type: 'tool_use', ... }] returns at line 106 and never formats the tool_use item. That regresses voice context by dropping tool-call updates that were previously emitted by the loop.
Suggested fix:
const speakable = !isContentArray(content)
? extractSpeakableFromContent(content)
: null
if (speakable) {
const roleForFormat = normalizedRole === 'user' ? 'user' : 'assistant'
return formatPlainText(roleForFormat, speakable)
}Add a regression test with a mixed text + tool_use array asserting both the text and tool-call lines are present.
There was a problem hiding this comment.
Fixed in 818bf7c. Guarded the call with !isContentArray(content) so arrays fall through to the existing loop as before. Added a regression test with a mixed text+tool_use payload — was red without the guard, green after.
…essage extractSpeakableFromContent also handles content arrays (joins text items), so calling it unconditionally before the existing array loop caused mixed text+tool_use payloads to return early without formatting the tool_use item. Guard with !isContentArray so the loop handles arrays as before. Adds regression test: mixed text+tool_use array must produce both the text and the tool-call line (was red before this fix). via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
There was a problem hiding this comment.
Findings
- [Major] Status events can be spoken as the last assistant answer —
extractSpeakableFromContentnow accepts any{ data: { type: 'message', message } }envelope, not just Codex payloads. ExistingsendSessionEvent({ type: 'message', message })emits{ role: 'agent', content: { type: 'event', data: ... } }, so abort/retry/failure status events can enter voice context asClaude Code:and can be selected byextractLastAssistantSpeakablefor the ready readback instead of the actual final assistant response. Evidence:web/src/realtime/hooks/contextFormatters.ts:152, related producercli/src/api/apiSession.ts:541.
Suggested fix:if (isObject(content) && content.type === 'codex' && isObject(content.data)) { const data = content.data if (data.type === 'message' && typeof data.message === 'string' && data.message.trim()) { return data.message.trim() } }
Questions
- None.
Summary
- Review mode: initial
- One issue found: Codex message extraction is too broad and can turn agent status events into spoken assistant answers.
Testing
- Not run:
bun test web/src/realtime/hooks/contextFormatters.test.tsfailed becausebunis not available in this runner.
HAPI Bot
| } | ||
|
|
||
| // Codex / stream-json agent messages: { type: 'codex', data: { type: 'message', message: '...' } } | ||
| if (isObject(content) && typeof content.type === 'string' && isObject(content.data)) { |
There was a problem hiding this comment.
[Major] extractSpeakableFromContent now accepts any object whose data.type is 'message', but existing session status events use { type: 'event', data: { type: 'message', message } } via sendSessionEvent. That means abort/retry/failure status messages can be formatted as Claude Code: and can become the ready readback text instead of the actual assistant answer.
Suggested fix:
if (isObject(content) && content.type === 'codex' && isObject(content.data)) {
const data = content.data
if (data.type === 'message' && typeof data.message === 'string' && data.message.trim()) {
return data.message.trim()
}
}There was a problem hiding this comment.
Fixed in 09e9b55. Narrowed to content.type === 'codex' — as the comment already said it should be. Added regression test: { type: 'event', data: { type: 'message', message: '...' } } must return null from formatMessage (was red before the narrowing, green after).
The helper matched any object with a string type and a data property,
so sendSessionEvent({ type: 'message', message }) events (which arrive as
{ type: 'event', data: { type: 'message', message } }) were falsely formatted
as speakable assistant text and could be selected as the ready readback.
Narrow the Codex path to content.type === 'codex' as the comment already states.
Adds regression test: session status event must return null from formatMessage.
via [HAPI](https://hapi.run)
Co-Authored-By: HAPI <noreply@hapi.run>
There was a problem hiding this comment.
Findings
- None.
Questions
- None.
Summary
- Review mode: follow-up after new commits
- No high-confidence issues found in the latest diff. The follow-up change narrows Codex message extraction to
type: 'codex'and adds coverage for session status events not being spoken as assistant text. Residual risk: tests could not be run in this runner becausebunis unavailable.
Testing
- Not run (automation):
bun test web/src/realtime/hooks/contextFormatters.test.tsfailed withbun: command not found.
HAPI Bot
|
Woohoo! It's a spectacular tool, so glad I could help 🙏
…On Mon, 25 May 2026, 06:31 weishu, ***@***.***> wrote:
Merged #682 <#682> into main.
—
Reply to this email directly, view it on GitHub
<#682 (comment)>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A7X33OEH2CCSBO4N2Q3QL2344PLEDAVCNFSM6AAAAACZLMXXG6VHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMRVHEYTGMRUG43TINY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Summary
Sorry if this is the wrong seam - I might be missing context from #640.
When a coding agent finishes (
readyevent), the ElevenLabs voice assistant often did not read back the agent's answer. The user had to ask repeatedly; ConvAI sometimes hallucinated partial summaries.Root cause (I think):
formatReadyEventused to tell ConvAI the summary was in "previous messages" without embedding assistant text inline.formatMessagedid not format Codex/Cursor stream-json payloads, soonMessagesand session history context were empty for those sessions.This PR embeds the last speakable assistant message in the ready inject (
<text>…</text>) and teachesformatMessage/extractLastAssistantSpeakablethe same codex stream-json path.Test plan
bun test web/src/realtime/hooks/contextFormatters.test.ts(12 tests)conv_4501ksdt0athfhfr189tq3jehkcq(399s, Cursor session). At ~329s, ready inject included embedded<text>…</text>with the agent's last message; at ~330s voice summarized it accurately (e.g. DB 2.43 GiB → 842 MiB) without the user re-prompting.Issues
Fixes #681
Notes
formatPlainText) - happy to follow up separately if preferred.