A production-leaning voice assistant that runs 100% on your machine by default. Speak, get a natural-sounding reply spoken back, with sub-second first-audio-byte latency targeting on a mid-range laptop.
Pipeline: 🎙️ Mic → Faster-Whisper (STT) → Ollama / Groq (LLM) → Kokoro / Piper (TTS) → 🔊 Speakers — streamed over a single WebSocket.
- Why this exists
- Pros and cons (read before adopting)
- Status
- Quickstart
- Architecture at a glance
- Configuration
- Benchmarks & eval harness
- Observability
- Fine-tuning
- Project layout
- API surface
- Roadmap
- Documentation
- Security
- Contributing
- License
Every major voice assistant today is cloud-only, closed, and sends your microphone audio to someone else's datacenter. This project is the opposite: a streaming, local-first pipeline you can audit line by line, run disconnected from the internet, and swap components out of.
Design principles:
- Local by default, cloud when useful. Ollama is the default LLM; Groq is a swap for benchmarking / low-RAM hardware.
- Measure before you optimize. Every phase ships alongside an eval runner and a design doc. No "feels faster" claims — see ADR 0001.
- Small, reversible commits. Each phase is independently revertible. No big-bang migrations.
- Honest caveats over marketing. The "cons" section below is longer than the "pros" — on purpose.
| What | Why it matters |
|---|---|
| 100% offline capable | No API keys, no cloud, no data egress. LLM_PROVIDER=ollama + local STT/TTS. |
| Free and open-source end-to-end | Qwen2.5 (Apache-2.0), Whisper (MIT), Kokoro (Apache-2.0), Piper (MIT), Silero VAD (MIT). No CPML/CC-BY-NC model in the default path. |
| Streaming pipeline | LLM tokens and TTS sentences overlap; first audio byte lands before the full reply has generated. |
| Swappable providers | LLM_PROVIDER (ollama | groq), TTS_PROVIDER (kokoro | piper | openvoice). Adding a new one is a single class. |
| Eval harness built in | Every claim is measurable: STT WER, LLM keyword-accuracy, TTS RTF, E2E first-audio-byte p50/p95. |
| Hardened for exposure | Opt-in API key, in-memory rate limiter, JSON logging, CORS allowlist. Not just localhost-grade. |
| Multilingual TTS path | Piper voices for Hindi, Tamil, Telugu, Bengali, Marathi, and 30+ others. |
| What | Impact |
|---|---|
| Local LLM quality < frontier APIs | Qwen2.5-3B is good but not GPT-4. For deep reasoning, switch LLM_PROVIDER=groq or run qwen2.5:7b if you have the RAM. |
| Sub-500 ms is aspirational, not guaranteed | Target is <1 s first-audio-byte on CPU + 3B model. <500 ms requires a GPU and careful tuning. See Benchmarks. |
| Browser AEC is the only echo canceller | Open-speaker barge-in works because Chrome's echoCancellation runs on the mic. Headphones are more reliable. No server-side AEC yet. |
| No multi-user concurrency tuning | Single-user design. Model handles are shared; a second concurrent turn serializes on the TTS thread. |
| Voice cloning is scaffolded, not validated | Phase D' (OpenVoice v2) ships the provider + consent gate, but the model setup + watermarking are a manual one-time step. See phase-d-notes. |
| Raspberry Pi build untested here | Phase E has the ARM Dockerfile + perf targets but needs a physical Pi to validate. |
| Sentence splitter is naive | "Dr. Smith" may split early. Acceptable in practice; fix is tracked. |
| Rate limit is in-memory, single-process | Swap for Redis if you deploy multiple replicas. |
| WebSocket auth not wired by default | require_api_key_ws exists in app/core/auth.py but /ws/voice doesn't call it yet. One-line change when you deploy. |
faster-whisper on CPU is the latency floor |
Real streaming STT (whisper-streaming, Moonshine) would cut 200–500 ms but is deferred. |
| Phase | Scope | State | Notes |
|---|---|---|---|
| 0 | Eval harness + baseline metrics | ✅ shipped | docs/adr/0001 |
| A | Ollama provider (local LLM) | ✅ shipped | docs/adr/0002 |
| B.1 | Server-side VAD endpointing | ✅ shipped | notes |
| B.2 | Sentence-level TTS streaming | ✅ shipped | notes |
| B.3 | Streaming LLM tokens → sentences | ✅ shipped | notes |
| B.4 | Continuous mic + server VAD | notes | |
| B.5 | Barge-in + turn cancellation | ✅ click-during-speaking | notes |
| 0.5 | Audio hygiene utilities | ✅ shipped | notes |
| C | TTS provider abstraction + Piper | ✅ shipped (voice download is manual) | notes |
| D' | OpenVoice v2 cloning scaffold | notes | |
| E | ARM64 edge build | notes | |
| F | Hardening essentials | ✅ auth, rate limit, CORS, JSON logs, CI | notes |
| G.1 | LLM-as-judge eval | ✅ shipped | notes |
| G.2 | OpenTelemetry + Jaeger | ✅ shipped | notes |
| G.3 | LoRA fine-tune Qwen2.5 | ✅ code + notebook; runs on Colab T4 | notes |
Legend: ✅ runs + tested ·
git clone https://github.com/HemantBK/AI-Voice-Assistant.git
cd AI-Voice-Assistant
cp backend/.env.example backend/.env # defaults to LLM_PROVIDER=ollama
docker compose upFirst boot downloads qwen2.5:3b into the ollama_data volume (~2 GB). Open http://localhost:5173.
Prerequisites: Python 3.11+, Node 20+, Ollama, ffmpeg.
# Terminal 1 — LLM
ollama pull qwen2.5:3b
# Terminal 2 — backend
cd backend
python -m venv .venv && . .venv/Scripts/activate # Linux/macOS: source .venv/bin/activate
pip install -e ".[llm,audio,observability]" # or: pip install -r requirements.txt
cp .env.example .env
python run.py # → http://localhost:8000
# Terminal 3 — frontend
cd frontend
npm ci
npm run dev # → http://localhost:5173# backend/.env
LLM_PROVIDER=groq
GROQ_API_KEY=<your-free-key-from-console.groq.com>No Ollama needed. STT + TTS still run locally.
flowchart LR
subgraph Browser["🌐 Browser"]
Mic["🎙️ getUserMedia<br/>+ AudioWorklet<br/>(48k → 16k PCM16)"]
Player["🔊 StreamingAudioPlayer<br/>(gapless, seq-ordered)"]
WS["VoiceWsClient"]
Mic --> WS
WS --> Player
end
subgraph Backend["⚙️ FastAPI + uvicorn"]
MW["Middleware chain<br/>CORS → APIKey<br/>→ RateLimit → Timing"]
TM["TurnManager<br/>(1 in-flight turn<br/>cancel on barge-in)"]
VAD["FrameVad<br/>(Silero)"]
STT["stt_service<br/>(Faster-Whisper)"]
Split["IncrementalSentence<br/>Splitter"]
LLM["llm_service<br/>.chat_stream"]
TTS["tts_service"]
MW --> TM
TM --> VAD
VAD --> STT
STT --> LLM
LLM -.tokens.-> Split
Split -.sentence.-> TTS
end
subgraph Models["Providers"]
Ollama[("Ollama<br/>local")]
Groq[("Groq<br/>cloud opt-in")]
Kokoro[("Kokoro<br/>en/ja/zh")]
Piper[("Piper<br/>multilingual")]
end
WS <-->|"WebSocket<br/>JSON frames"| MW
LLM --> Ollama
LLM --> Groq
TTS --> Kokoro
TTS --> Piper
classDef ext fill:#eef,stroke:#88a,stroke-width:1px
class Ollama,Groq,Kokoro,Piper ext
Deep dive, tradeoffs, and failure modes: ARCHITECTURE.md.
Everything is env-driven. Full reference in backend/.env.example. The ones you actually change:
| Variable | Default | What it does |
|---|---|---|
LLM_PROVIDER |
ollama |
ollama (local) | groq (cloud) |
OLLAMA_MODEL |
qwen2.5:3b |
Try qwen2.5:7b on ≥16 GB RAM |
OLLAMA_HOST |
http://localhost:11434 |
Point at a remote Ollama if desired |
GROQ_API_KEY |
(empty) | Required when LLM_PROVIDER=groq |
TTS_PROVIDER |
kokoro |
kokoro | piper | openvoice |
PIPER_VOICE |
(empty) | e.g. hi_IN-pratham-medium |
WHISPER_MODEL_SIZE |
base |
tiny / base / small / medium / large-v3 |
WHISPER_VAD_FILTER |
true |
Silero VAD trim inside Whisper (B.1) |
API_KEY |
(empty) | Set this when deploying publicly |
ALLOWED_ORIGINS |
* |
Comma-separated list in prod |
RATE_LIMIT_PER_MINUTE |
60 |
0 disables |
LOG_FORMAT |
text |
json for structured logs |
Dependencies live in backend/pyproject.toml with named extras so each scenario has a one-line install:
| Scenario | Command |
|---|---|
| Everything (local dev) | pip install -e "backend[all]" |
| Runtime only (Docker-equivalent) | pip install -e "backend[llm,audio,observability]" or pip install -r backend/requirements.txt |
| Tests only (CI) | pip install -e "backend[dev,observability]" |
| Tracing off, no test tooling | pip install -e "backend[llm,audio]" |
Extras defined: audio (Whisper + Kokoro + Silero), llm (Groq + Ollama SDKs), observability (OpenTelemetry), dev (pytest + ruff), all (union).
This project takes measurement seriously. Every phase has a reproducible eval.
# Build the baseline (no server needed for LLM + TTS)
python -m eval.runners.baseline
# End-to-end streaming latency (server must be running)
python -m eval.runners.eval_tts --emit-stt-fixtures
python -m eval.runners.eval_streaming \
--url 'ws://127.0.0.1:8000/ws/voice?stream=1' \
--fixture eval/datasets/stt/tts-roundtrip-00.wav \
--runs 10 --saveMetrics reported:
| Metric | What it captures |
|---|---|
first_llm_delta_ms |
time to first LLM token (isolates LLM latency) |
first_audio_byte_ms |
time to first TTS chunk (what the user feels) |
tts_end_ms |
total turn duration |
mean_wer |
Whisper accuracy on your STT manifest |
mean_rtf |
TTS synth-time / audio-time |
vad_trimmed_ms |
silence removed by Silero VAD |
judge mean (C/R/Cn) |
LLM-as-judge score on correctness / relevance / conciseness (1–5) |
within_1 agreement |
two-judge inter-rater agreement (cross-check for bias) |
Results land in eval/results/*.json — gitignored, machine-specific. Diff across runs to prove deltas.
Keyword-hit scoring ("does the word 'Paris' appear in the answer?") is cheap but shallow. For a credible quality signal we score each reply with a stronger LLM against a published rubric: correctness, relevance, conciseness (each 1–5). An optional second judge runs in parallel so we can report inter-rater agreement — if two independent models don't agree, the metric itself is noise.
# Local, free, no API key — uses qwen2.5:7b via Ollama
python -m eval.runners.eval_llm --judge ollama --save
# Cross-check with a Groq-hosted judge from a different model family
python -m eval.runners.eval_llm \
--judge ollama:qwen2.5:7b \
--judge2 groq:llama-3.3-70b-versatile \
--saveJudge failures (bad JSON, timeout) are recorded per-item and don't fail the run. Rubric + known biases: eval/datasets/llm/rubric.md. Design: docs/design/phase-g1-llm-judge.md.
Design rationale for the harness itself: ADR 0001.
.
├── backend/
│ ├── app/
│ │ ├── main.py # FastAPI app + middleware chain
│ │ ├── config.py # All env-driven knobs
│ │ ├── core/
│ │ │ ├── auth.py # APIKeyMiddleware + WS gate
│ │ │ ├── rate_limit.py # Token-bucket per-key/IP
│ │ │ ├── logging.py # JSON / text formatter
│ │ │ └── timing.py # Per-stage X-Stage-*-Ms headers
│ │ ├── routers/
│ │ │ └── pipeline.py # /api/pipeline + /ws/voice
│ │ ├── services/
│ │ │ ├── stt_service.py # Faster-Whisper wrapper
│ │ │ ├── llm_service.py # Thin router to llm/ providers
│ │ │ ├── llm/ # Ollama | Groq providers
│ │ │ ├── tts_service.py # Thin router to tts/ providers
│ │ │ └── tts/ # Kokoro | Piper | OpenVoice
│ │ ├── streaming/
│ │ │ ├── async_stream.py # Sync→async generator bridge
│ │ │ ├── sentence_splitter.py
│ │ │ ├── turn_manager.py # Single in-flight turn + cancel
│ │ │ ├── vad.py # Silero frame-VAD
│ │ │ └── wav.py # PCM16 → WAV writer
│ │ └── audio/resample.py # Hygiene utilities
│ ├── tests/ # pytest unit tests
│ ├── requirements.txt
│ ├── pyproject.toml # pytest + ruff config
│ └── Dockerfile[.arm64]
├── frontend/
│ ├── public/recorder-worklet.js
│ ├── src/
│ │ ├── audio/ # streamingPlayer, microphoneStream, clientVad
│ │ ├── services/ # voiceWsClient, consent
│ │ └── components/ # VoiceAssistant, ConsentGate
│ └── package.json
├── eval/
│ ├── lib/{metrics,reporter}.py
│ ├── runners/eval_{llm,stt,tts,latency,streaming,baseline}.py
│ └── datasets/ # Golden Q&A + STT/TTS prompts
├── docs/
│ ├── adr/ # Architecture Decision Records
│ ├── design/ # Phase-by-phase notes (B.1–F)
│ └── design-doc-template.md
├── docker-compose.yml # Ollama + backend + frontend
└── .github/workflows/ci.yml # Lint + test matrix
POST /api/pipeline— one-shot: audio in,{transcript, response, audio_b64}out. Good for eval; simple clients.POST /api/stt/transcribe,POST /api/chat/,POST /api/tts/synthesize— individual stages.GET /health,GET /ready— liveness / readiness.
Query params:
?stream=1— stream LLM tokens + sentence-level TTS (B.2 + B.3).?continuous=1— acceptaudio_framemessages; server VAD detects end-of-turn (B.4).
Full protocol, including barge_in, cancelled, and llm_delta, documented in docs/design/phase-b-streaming-pipeline.md.
See CHANGELOG.md for what landed in each phase and docs/design/ for the design docs.
timeline
title Shipping history
Phase 0 : Eval harness + baseline metrics
Phase A : Ollama local LLM provider
Phase B.1 : Whisper VAD endpointing
Phase B.2 : Sentence-level TTS streaming
Phase B.3 : LLM token streaming
Phase B.4 : Continuous mic (plumbing)
Phase B.5 : Barge-in + cancellation
Phase 0.5 : Audio hygiene utilities
Phase C : Multilingual TTS (Piper)
Phase D' : Voice cloning scaffold
Phase E : ARM64 edge scaffold
Phase F : Hardening essentials
Next up (priority order):
- Validate B.2/B.3 in a real browser; capture baseline numbers.
- Wire B.4 continuous mic into the UI behind a hands-free toggle.
- Real-speech fixtures for STT eval (currently TTS-roundtrip only).
- Replace in-memory rate limiter with Redis when a second replica is needed.
- OpenTelemetry migration for the timing middleware (ADR 0001 follow-up).
Opt-in OpenTelemetry tracing (Phase G.2). Every voice turn becomes a waterfall of pipeline.stt → pipeline.llm_stream → pipeline.tts spans with attributes (language, audio bytes, token count, VAD trim). Exports over OTLP/HTTP, so any compliant backend works — Jaeger, Tempo, Grafana Cloud, Honeycomb, Datadog.
Zero-account local stack:
docker compose -f docker-compose.yml -f docker-compose.observability.yml up
# → Jaeger UI at http://localhost:16686
# → pick service "voice-assistant-backend"Flip OTEL_ENABLED=false (the default) to disable. The X-Stage-*-Ms response headers the eval runner depends on stay on regardless. Design + setup: docs/design/phase-g2-observability.md.
LoRA fine-tune Qwen2.5-3B-Instruct on your own chat data and serve the result through the existing Ollama provider with zero backend changes (Phase G.3). Paint-by-numbers Colab notebook + reproducible CLI scripts + Ollama Modelfile + A/B eval runner.
# 1. Train on Colab (free T4) using finetune/train.ipynb
# 2. Back on your machine, register the merged model:
cd finetune/out && ollama create voice-assistant-ft -f Modelfile
# 3. Flip backend/.env to use it:
# OLLAMA_MODEL=voice-assistant-ft
# 4. Prove the fine-tune helped (or didn't) with judge scores:
python -m eval.runners.eval_llm_compare \
--base qwen2.5:3b --finetuned voice-assistant-ft --judge ollama --saveFull guide with dataset curation tips + honest caveats: finetune/README.md. Design rationale: docs/design/phase-g3-lora-finetune.md.
| Doc | When to read it |
|---|---|
| README.md | This page — overview + quickstart |
| ARCHITECTURE.md | Deep dive, dataflow, failure modes, tradeoffs |
| CHANGELOG.md | What shipped in each phase |
| CONTRIBUTING.md | Dev setup, style, PR process |
| SECURITY.md | Threat model + disclosure policy |
| docs/adr/ | Immutable architectural decisions |
| docs/design/ | Per-phase ship notes (B.1 → F) |
| docs/design-doc-template.md | Template for your next phase |
| eval/README.md | How to run + read benchmarks |
- API key gate, rate limiter, CORS allowlist are opt-in (set
API_KEY,ALLOWED_ORIGINS). - Mic audio never leaves the box in local mode (
LLM_PROVIDER=ollama). Verify with the Network tab. - No audio retention: server holds the current turn's PCM in memory and drops it.
- Voice cloning path is gated by a consent modal; see phase-d-notes.
- Report vulnerabilities via the process in SECURITY.md. Do not file public issues for security bugs.
PRs welcome. Rules:
- Open an issue first for anything non-trivial.
- New phases or protocol changes require a design doc in
docs/design/before code. - Every change that claims a perf or quality delta must include eval-harness numbers in the PR description.
- Tests pass, lint clean:
pytest backend/tests+npx eslint frontend/src. - Conventional commits (
feat:,fix:,docs:,refactor:,test:).
Full guide: CONTRIBUTING.md.
MIT — see LICENSE. Third-party components retain their own licenses; see the credits block in ARCHITECTURE.md.
Built on top of: Ollama, faster-whisper / OpenAI Whisper, Kokoro, Piper, Silero VAD, Qwen, FastAPI, Vite, React. Thanks to every maintainer above.