A demo of a real-time, voice-to-voice AI pipeline orchestrated using the Web Streams API on Node.js.
This project is architected entirely around standard ReadableStream, WritableStream, and TransformStream interfaces. This approach allows for efficient, backpressure-aware data processing where each step in the pipeline handles a specific transformation of the audio or text signal.
flowchart TD
subgraph Client [Browser Client]
Mic[Microphone] -->|Opus/WebM Stream| WS_Out[WebSocket]
WS_In[WebSocket] -->|PCM Float32 Stream| Speaker[AudioContext]
end
subgraph Server [Node.js Hono Server]
WS_Receiver[WS Receiver] -->|Buffer Stream| Pipeline
subgraph Pipeline [Web Streams Pipeline]
direction TB
Opus[OpusToPcmTransform] -->|Raw PCM Int16| VAD[VADBufferTransform]
VAD -->|Buffered Speech| STT[OpenAISTTTransform]
STT -->|Text String| Agent[AgentTransform]
Agent -->|AIMessageChunk| Filter[MessageChunkTransform]
Filter -->|Text Stream| Chunker[SentenceChunkerTransform]
Chunker -->|Sentence String| TTS[ElevenLabsTTSTransform]
end
TTS -->|PCM Int16 Buffer| WS_Sender[WS Sender]
end
WS_Out <--> WS_Receiver
WS_Sender <--> WS_In
The core logic resides in packages/web/src/index.ts, where the pipeline is composed:
inputStream
.pipeThrough(new OpusToPcmTransform()) // ffmpeg: WebM -> PCM
.pipeThrough(new VADBufferTransform()) // Silero VAD: Gates stream on speech detection
.pipeThrough(new OpenAISTTTransform()) // OpenAI Whisper: Audio Buffer -> Text
.pipeThrough(new AgentTransform(graph)) // LangGraph: Text -> Streaming AI Tokens
.pipeThrough(new AIMessageChunkTransform()) // Formatting: Chunk -> String
.pipeThrough(new SentenceChunkTransform()) // Optimization: Buffers tokens into sentences
.pipeThrough(new ElevenLabsTTSTransform()) // ElevenLabs: Text -> Streaming AudioOpusToPcmTransform: Spawns anffmpegprocess to transcode the incoming browser-native WebM/Opus stream into raw PCM (16kHz, 16-bit, Mono) required for VAD and downstream processing.VADBufferTransform: Utilizes@ericedouard/vad-node-realtime(running the Silero VAD model via ONNX) to analyze the PCM stream. It acts as a gate, buffering audio frames and only emitting a consolidated buffer when a "speech end" event is triggered.AgentTransform: Wraps a createAgent / LangGraph runnable. It takes a string input (transcription), runs the agent graph, and streams the resultingAIMessageChunkobjects.ElevenLabsTTSTransform: Manages a WebSocket connection to ElevenLabs. It sends text sentences as they become available and yields the returned PCM audio buffers.
- ffmpeg (system installed, required for both implementations)
- Node.js (v18+)
- pnpm (or npm)
- API Keys:
OPENAI_API_KEY: For Whisper STTELEVENLABS_API_KEY&ELEVENLABS_VOICE_ID: For Text-to-SpeechGOOGLE_API_KEY: For the Gemini model driving the LangGraph agent
- Python (3.11+)
- uv (Python package manager)
- API Keys:
ANTHROPIC_API_KEY: For Claude model driving the LangGraph agentASSEMBLYAI_API_KEY: For Speech-to-TextELEVENLABS_API_KEY&ELEVENLABS_VOICE_ID: For Text-to-Speech
You can run either the TypeScript or Python implementation. Both serve the same web interface.
-
Install Dependencies:
cd components/typescript npm install -
Environment Configuration: Create
components/typescript/.env:OPENAI_API_KEY=sk-... ELEVENLABS_API_KEY=... ELEVENLABS_VOICE_ID=... GOOGLE_API_KEY=...
-
Start Server:
npm run server
The app will be available at
http://localhost:3000
-
Install Dependencies:
cd components/python uv sync --dev -
Environment Configuration: Create
components/python/.env:ANTHROPIC_API_KEY=... ASSEMBLYAI_API_KEY=... ELEVENLABS_API_KEY=... ELEVENLABS_VOICE_ID=...
-
Start Server:
uv run src/main.py
The app will be available at
http://localhost:8000
Alternatively, you can use the provided Makefile:
# Install both implementations
make bootstrap
# Run TypeScript server
make start-ts
# Run Python server
make start-py