Voice Assistant — Local-first, Streaming, Open-source

A production-leaning voice assistant that runs 100% on your machine by default. Speak, get a natural-sounding reply spoken back, with sub-second first-audio-byte latency targeting on a mid-range laptop.

Pipeline: 🎙️ Mic → Faster-Whisper (STT) → Ollama / Groq (LLM) → Kokoro / Piper (TTS) → 🔊 Speakers — streamed over a single WebSocket.

Why this exists

Every major voice assistant today is cloud-only, closed, and sends your microphone audio to someone else's datacenter. This project is the opposite: a streaming, local-first pipeline you can audit line by line, run disconnected from the internet, and swap components out of.

Design principles:

Local by default, cloud when useful. Ollama is the default LLM; Groq is a swap for benchmarking / low-RAM hardware.
Measure before you optimize. Every phase ships alongside an eval runner and a design doc. No "feels faster" claims — see ADR 0001.
Small, reversible commits. Each phase is independently revertible. No big-bang migrations.
Honest caveats over marketing. The "cons" section below is longer than the "pros" — on purpose.

Pros and cons (read before adopting)

✅ Pros

What	Why it matters
100% offline capable	No API keys, no cloud, no data egress. `LLM_PROVIDER=ollama` + local STT/TTS.
Free and open-source end-to-end	Qwen2.5 (Apache-2.0), Whisper (MIT), Kokoro (Apache-2.0), Piper (MIT), Silero VAD (MIT). No CPML/CC-BY-NC model in the default path.
Streaming pipeline	LLM tokens and TTS sentences overlap; first audio byte lands before the full reply has generated.
Swappable providers	`LLM_PROVIDER` (ollama \| groq), `TTS_PROVIDER` (kokoro \| piper \| openvoice). Adding a new one is a single class.
Eval harness built in	Every claim is measurable: STT WER, LLM keyword-accuracy, TTS RTF, E2E first-audio-byte p50/p95.
Hardened for exposure	Opt-in API key, in-memory rate limiter, JSON logging, CORS allowlist. Not just localhost-grade.
Multilingual TTS path	Piper voices for Hindi, Tamil, Telugu, Bengali, Marathi, and 30+ others.

⚠️ Cons / honest caveats

What	Impact
Local LLM quality < frontier APIs	Qwen2.5-3B is good but not GPT-4. For deep reasoning, switch `LLM_PROVIDER=groq` or run `qwen2.5:7b` if you have the RAM.
Sub-500 ms is aspirational, not guaranteed	Target is <1 s first-audio-byte on CPU + 3B model. <500 ms requires a GPU and careful tuning. See Benchmarks.
Browser AEC is the only echo canceller	Open-speaker barge-in works because Chrome's `echoCancellation` runs on the mic. Headphones are more reliable. No server-side AEC yet.
No multi-user concurrency tuning	Single-user design. Model handles are shared; a second concurrent turn serializes on the TTS thread.
Voice cloning is scaffolded, not validated	Phase D' (OpenVoice v2) ships the provider + consent gate, but the model setup + watermarking are a manual one-time step. See phase-d-notes.
Raspberry Pi build untested here	Phase E has the ARM Dockerfile + perf targets but needs a physical Pi to validate.
Sentence splitter is naive	"Dr. Smith" may split early. Acceptable in practice; fix is tracked.
Rate limit is in-memory, single-process	Swap for Redis if you deploy multiple replicas.
WebSocket auth not wired by default	`require_api_key_ws` exists in `app/core/auth.py` but `/ws/voice` doesn't call it yet. One-line change when you deploy.
`faster-whisper` on CPU is the latency floor	Real streaming STT (whisper-streaming, Moonshine) would cut 200–500 ms but is deferred.

Status

Phase	Scope	State	Notes
0	Eval harness + baseline metrics	✅ shipped	docs/adr/0001
A	Ollama provider (local LLM)	✅ shipped	docs/adr/0002
B.1	Server-side VAD endpointing	✅ shipped	notes
B.2	Sentence-level TTS streaming	✅ shipped	notes
B.3	Streaming LLM tokens → sentences	✅ shipped	notes
B.4	Continuous mic + server VAD	⚠️ plumbing only, UI opt-in	notes
B.5	Barge-in + turn cancellation	✅ click-during-speaking	notes
0.5	Audio hygiene utilities	✅ shipped	notes
C	TTS provider abstraction + Piper	✅ shipped (voice download is manual)	notes
D'	OpenVoice v2 cloning scaffold	⚠️ scaffold, needs validation + checkpoints	notes
E	ARM64 edge build	⚠️ scaffold, needs hardware	notes
F	Hardening essentials	✅ auth, rate limit, CORS, JSON logs, CI	notes
G.1	LLM-as-judge eval	✅ shipped	notes
G.2	OpenTelemetry + Jaeger	✅ shipped	notes
G.3	LoRA fine-tune Qwen2.5	✅ code + notebook; runs on Colab T4	notes

Legend: ✅ runs + tested · ⚠️ builds cleanly but needs user-side validation.

Quickstart

Option A — Docker (zero-dep, recommended)

git clone https://github.com/HemantBK/AI-Voice-Assistant.git
cd AI-Voice-Assistant
cp backend/.env.example backend/.env      # defaults to LLM_PROVIDER=ollama
docker compose up

First boot downloads qwen2.5:3b into the ollama_data volume (~2 GB). Open http://localhost:5173.

Option B — Native (Python + Node)

Prerequisites: Python 3.11+, Node 20+, Ollama, ffmpeg.

# Terminal 1 — LLM
ollama pull qwen2.5:3b

# Terminal 2 — backend
cd backend
python -m venv .venv && . .venv/Scripts/activate    # Linux/macOS: source .venv/bin/activate
pip install -e ".[llm,audio,observability]"          # or: pip install -r requirements.txt
cp .env.example .env
python run.py                                         # → http://localhost:8000

# Terminal 3 — frontend
cd frontend
npm ci
npm run dev                                           # → http://localhost:5173

Option C — Cloud LLM (fastest, requires Groq sign-up)

# backend/.env
LLM_PROVIDER=groq
GROQ_API_KEY=<your-free-key-from-console.groq.com>

No Ollama needed. STT + TTS still run locally.

Architecture at a glance

flowchart LR
    subgraph Browser["🌐 Browser"]
        Mic["🎙️ getUserMedia<br/>+ AudioWorklet<br/>(48k → 16k PCM16)"]
        Player["🔊 StreamingAudioPlayer<br/>(gapless, seq-ordered)"]
        WS["VoiceWsClient"]
        Mic --> WS
        WS --> Player
    end

    subgraph Backend["⚙️ FastAPI + uvicorn"]
        MW["Middleware chain<br/>CORS → APIKey<br/>→ RateLimit → Timing"]
        TM["TurnManager<br/>(1 in-flight turn<br/>cancel on barge-in)"]
        VAD["FrameVad<br/>(Silero)"]
        STT["stt_service<br/>(Faster-Whisper)"]
        Split["IncrementalSentence<br/>Splitter"]
        LLM["llm_service<br/>.chat_stream"]
        TTS["tts_service"]

        MW --> TM
        TM --> VAD
        VAD --> STT
        STT --> LLM
        LLM -.tokens.-> Split
        Split -.sentence.-> TTS
    end

    subgraph Models["Providers"]
        Ollama[("Ollama<br/>local")]
        Groq[("Groq<br/>cloud opt-in")]
        Kokoro[("Kokoro<br/>en/ja/zh")]
        Piper[("Piper<br/>multilingual")]
    end

    WS <-->|"WebSocket<br/>JSON frames"| MW
    LLM --> Ollama
    LLM --> Groq
    TTS --> Kokoro
    TTS --> Piper

    classDef ext fill:#eef,stroke:#88a,stroke-width:1px
    class Ollama,Groq,Kokoro,Piper ext

Deep dive, tradeoffs, and failure modes: ARCHITECTURE.md.

Configuration

Everything is env-driven. Full reference in backend/.env.example. The ones you actually change:

Variable	Default	What it does
`LLM_PROVIDER`	`ollama`	`ollama` (local) \| `groq` (cloud)
`OLLAMA_MODEL`	`qwen2.5:3b`	Try `qwen2.5:7b` on ≥16 GB RAM
`OLLAMA_HOST`	`http://localhost:11434`	Point at a remote Ollama if desired
`GROQ_API_KEY`	(empty)	Required when `LLM_PROVIDER=groq`
`TTS_PROVIDER`	`kokoro`	`kokoro` \| `piper` \| `openvoice`
`PIPER_VOICE`	(empty)	e.g. `hi_IN-pratham-medium`
`WHISPER_MODEL_SIZE`	`base`	`tiny` / `base` / `small` / `medium` / `large-v3`
`WHISPER_VAD_FILTER`	`true`	Silero VAD trim inside Whisper (B.1)
`API_KEY`	(empty)	Set this when deploying publicly
`ALLOWED_ORIGINS`	`*`	Comma-separated list in prod
`RATE_LIMIT_PER_MINUTE`	`60`	`0` disables
`LOG_FORMAT`	`text`	`json` for structured logs

Install recipes

Dependencies live in backend/pyproject.toml with named extras so each scenario has a one-line install:

Scenario	Command
Everything (local dev)	`pip install -e "backend[all]"`
Runtime only (Docker-equivalent)	`pip install -e "backend[llm,audio,observability]"` or `pip install -r backend/requirements.txt`
Tests only (CI)	`pip install -e "backend[dev,observability]"`
Tracing off, no test tooling	`pip install -e "backend[llm,audio]"`

Extras defined: audio (Whisper + Kokoro + Silero), llm (Groq + Ollama SDKs), observability (OpenTelemetry), dev (pytest + ruff), all (union).

Benchmarks & eval harness

This project takes measurement seriously. Every phase has a reproducible eval.

# Build the baseline (no server needed for LLM + TTS)
python -m eval.runners.baseline

# End-to-end streaming latency (server must be running)
python -m eval.runners.eval_tts --emit-stt-fixtures
python -m eval.runners.eval_streaming \
  --url 'ws://127.0.0.1:8000/ws/voice?stream=1' \
  --fixture eval/datasets/stt/tts-roundtrip-00.wav \
  --runs 10 --save

Metrics reported:

Metric	What it captures
`first_llm_delta_ms`	time to first LLM token (isolates LLM latency)
`first_audio_byte_ms`	time to first TTS chunk (what the user feels)
`tts_end_ms`	total turn duration
`mean_wer`	Whisper accuracy on your STT manifest
`mean_rtf`	TTS synth-time / audio-time
`vad_trimmed_ms`	silence removed by Silero VAD
`judge mean (C/R/Cn)`	LLM-as-judge score on correctness / relevance / conciseness (1–5)
`within_1 agreement`	two-judge inter-rater agreement (cross-check for bias)

Results land in eval/results/*.json — gitignored, machine-specific. Diff across runs to prove deltas.

LLM-as-judge

Keyword-hit scoring ("does the word 'Paris' appear in the answer?") is cheap but shallow. For a credible quality signal we score each reply with a stronger LLM against a published rubric: correctness, relevance, conciseness (each 1–5). An optional second judge runs in parallel so we can report inter-rater agreement — if two independent models don't agree, the metric itself is noise.

# Local, free, no API key — uses qwen2.5:7b via Ollama
python -m eval.runners.eval_llm --judge ollama --save

# Cross-check with a Groq-hosted judge from a different model family
python -m eval.runners.eval_llm \
  --judge  ollama:qwen2.5:7b \
  --judge2 groq:llama-3.3-70b-versatile \
  --save

Judge failures (bad JSON, timeout) are recorded per-item and don't fail the run. Rubric + known biases: eval/datasets/llm/rubric.md. Design: docs/design/phase-g1-llm-judge.md.

Design rationale for the harness itself: ADR 0001.

Project layout

.
├── backend/
│   ├── app/
│   │   ├── main.py              # FastAPI app + middleware chain
│   │   ├── config.py            # All env-driven knobs
│   │   ├── core/
│   │   │   ├── auth.py          # APIKeyMiddleware + WS gate
│   │   │   ├── rate_limit.py    # Token-bucket per-key/IP
│   │   │   ├── logging.py       # JSON / text formatter
│   │   │   └── timing.py        # Per-stage X-Stage-*-Ms headers
│   │   ├── routers/
│   │   │   └── pipeline.py      # /api/pipeline + /ws/voice
│   │   ├── services/
│   │   │   ├── stt_service.py   # Faster-Whisper wrapper
│   │   │   ├── llm_service.py   # Thin router to llm/ providers
│   │   │   ├── llm/             # Ollama | Groq providers
│   │   │   ├── tts_service.py   # Thin router to tts/ providers
│   │   │   └── tts/             # Kokoro | Piper | OpenVoice
│   │   ├── streaming/
│   │   │   ├── async_stream.py  # Sync→async generator bridge
│   │   │   ├── sentence_splitter.py
│   │   │   ├── turn_manager.py  # Single in-flight turn + cancel
│   │   │   ├── vad.py           # Silero frame-VAD
│   │   │   └── wav.py           # PCM16 → WAV writer
│   │   └── audio/resample.py    # Hygiene utilities
│   ├── tests/                   # pytest unit tests
│   ├── requirements.txt
│   ├── pyproject.toml           # pytest + ruff config
│   └── Dockerfile[.arm64]
├── frontend/
│   ├── public/recorder-worklet.js
│   ├── src/
│   │   ├── audio/               # streamingPlayer, microphoneStream, clientVad
│   │   ├── services/            # voiceWsClient, consent
│   │   └── components/          # VoiceAssistant, ConsentGate
│   └── package.json
├── eval/
│   ├── lib/{metrics,reporter}.py
│   ├── runners/eval_{llm,stt,tts,latency,streaming,baseline}.py
│   └── datasets/                # Golden Q&A + STT/TTS prompts
├── docs/
│   ├── adr/                     # Architecture Decision Records
│   ├── design/                  # Phase-by-phase notes (B.1–F)
│   └── design-doc-template.md
├── docker-compose.yml           # Ollama + backend + frontend
└── .github/workflows/ci.yml     # Lint + test matrix

API surface

REST

POST /api/pipeline — one-shot: audio in, {transcript, response, audio_b64} out. Good for eval; simple clients.
POST /api/stt/transcribe, POST /api/chat/, POST /api/tts/synthesize — individual stages.
GET /health, GET /ready — liveness / readiness.

WebSocket — `/ws/voice`

Query params:

?stream=1 — stream LLM tokens + sentence-level TTS (B.2 + B.3).
?continuous=1 — accept audio_frame messages; server VAD detects end-of-turn (B.4).

Full protocol, including barge_in, cancelled, and llm_delta, documented in docs/design/phase-b-streaming-pipeline.md.

Roadmap

See CHANGELOG.md for what landed in each phase and docs/design/ for the design docs.

timeline
    title Shipping history
    Phase 0    : Eval harness + baseline metrics
    Phase A    : Ollama local LLM provider
    Phase B.1  : Whisper VAD endpointing
    Phase B.2  : Sentence-level TTS streaming
    Phase B.3  : LLM token streaming
    Phase B.4  : Continuous mic (plumbing)
    Phase B.5  : Barge-in + cancellation
    Phase 0.5  : Audio hygiene utilities
    Phase C    : Multilingual TTS (Piper)
    Phase D'   : Voice cloning scaffold
    Phase E    : ARM64 edge scaffold
    Phase F    : Hardening essentials

Next up (priority order):

Validate B.2/B.3 in a real browser; capture baseline numbers.
Wire B.4 continuous mic into the UI behind a hands-free toggle.
Real-speech fixtures for STT eval (currently TTS-roundtrip only).
Replace in-memory rate limiter with Redis when a second replica is needed.
OpenTelemetry migration for the timing middleware (ADR 0001 follow-up).

Observability

Opt-in OpenTelemetry tracing (Phase G.2). Every voice turn becomes a waterfall of pipeline.stt → pipeline.llm_stream → pipeline.tts spans with attributes (language, audio bytes, token count, VAD trim). Exports over OTLP/HTTP, so any compliant backend works — Jaeger, Tempo, Grafana Cloud, Honeycomb, Datadog.

Zero-account local stack:

docker compose -f docker-compose.yml -f docker-compose.observability.yml up
# → Jaeger UI at http://localhost:16686
# → pick service "voice-assistant-backend"

Flip OTEL_ENABLED=false (the default) to disable. The X-Stage-*-Ms response headers the eval runner depends on stay on regardless. Design + setup: docs/design/phase-g2-observability.md.

Fine-tuning

LoRA fine-tune Qwen2.5-3B-Instruct on your own chat data and serve the result through the existing Ollama provider with zero backend changes (Phase G.3). Paint-by-numbers Colab notebook + reproducible CLI scripts + Ollama Modelfile + A/B eval runner.

# 1. Train on Colab (free T4) using finetune/train.ipynb
# 2. Back on your machine, register the merged model:
cd finetune/out && ollama create voice-assistant-ft -f Modelfile

# 3. Flip backend/.env to use it:
#    OLLAMA_MODEL=voice-assistant-ft

# 4. Prove the fine-tune helped (or didn't) with judge scores:
python -m eval.runners.eval_llm_compare \
    --base qwen2.5:3b --finetuned voice-assistant-ft --judge ollama --save

Full guide with dataset curation tips + honest caveats: finetune/README.md. Design rationale: docs/design/phase-g3-lora-finetune.md.

Documentation

Doc	When to read it
README.md	This page — overview + quickstart
ARCHITECTURE.md	Deep dive, dataflow, failure modes, tradeoffs
CHANGELOG.md	What shipped in each phase
CONTRIBUTING.md	Dev setup, style, PR process
SECURITY.md	Threat model + disclosure policy
docs/adr/	Immutable architectural decisions
docs/design/	Per-phase ship notes (B.1 → F)
docs/design-doc-template.md	Template for your next phase
eval/README.md	How to run + read benchmarks

Security

API key gate, rate limiter, CORS allowlist are opt-in (set API_KEY, ALLOWED_ORIGINS).
Mic audio never leaves the box in local mode (LLM_PROVIDER=ollama). Verify with the Network tab.
No audio retention: server holds the current turn's PCM in memory and drops it.
Voice cloning path is gated by a consent modal; see phase-d-notes.
Report vulnerabilities via the process in SECURITY.md. Do not file public issues for security bugs.

Contributing

PRs welcome. Rules:

Open an issue first for anything non-trivial.
New phases or protocol changes require a design doc in docs/design/ before code.
Every change that claims a perf or quality delta must include eval-harness numbers in the PR description.
Tests pass, lint clean: pytest backend/tests + npx eslint frontend/src.
Conventional commits (feat:, fix:, docs:, refactor:, test:).

Full guide: CONTRIBUTING.md.

License

MIT — see LICENSE. Third-party components retain their own licenses; see the credits block in ARCHITECTURE.md.

Credits

Built on top of: Ollama, faster-whisper / OpenAI Whisper, Kokoro, Piper, Silero VAD, Qwen, FastAPI, Vite, React. Thanks to every maintainer above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voice Assistant — Local-first, Streaming, Open-source

Table of Contents

Why this exists

Pros and cons (read before adopting)

✅ Pros

⚠️ Cons / honest caveats

Status

Quickstart

Option A — Docker (zero-dep, recommended)

Option B — Native (Python + Node)

Option C — Cloud LLM (fastest, requires Groq sign-up)

Architecture at a glance

Configuration

Install recipes

Benchmarks & eval harness

LLM-as-judge

Project layout

API surface

REST

WebSocket — `/ws/voice`

Roadmap

Observability

Fine-tuning

Documentation

Security

Contributing

License

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
backend		backend
docs		docs
eval		eval
finetune		finetune
frontend		frontend
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
ai-voice-assistant-free-toolkit.jsx		ai-voice-assistant-free-toolkit.jsx
docker-compose.observability.yml		docker-compose.observability.yml
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Voice Assistant — Local-first, Streaming, Open-source

Table of Contents

Why this exists

Pros and cons (read before adopting)

✅ Pros

⚠️ Cons / honest caveats

Status

Quickstart

Option A — Docker (zero-dep, recommended)

Option B — Native (Python + Node)

Option C — Cloud LLM (fastest, requires Groq sign-up)

Architecture at a glance

Configuration

Install recipes

Benchmarks & eval harness

LLM-as-judge

Project layout

API surface

REST

WebSocket — /ws/voice

Roadmap

Observability

Fine-tuning

Documentation

Security

Contributing

License

Credits

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

WebSocket — `/ws/voice`

Packages