Blazing‑fast local embeddings and true cross‑encoder reranking on Apple Silicon. Works with Native, OpenAI, TEI, and Cohere APIs.
This page is a beginner‑friendly quick start. Detailed guides live in docs/.
| API | Endpoint | Use Case |
|---|---|---|
| Native | /api/v1/embed, /api/v1/rerank |
New projects |
| OpenAI | /v1/embeddings, /v1/openai/rerank (alias: /v1/rerank_openai) |
Existing OpenAI code |
| TEI | /embed, /rerank, /info |
Hugging Face TEI replacement |
| Cohere | /v1/rerank, /v2/rerank |
Cohere API replacement |
/docs /health |
More info. |
Single Text Embedding Latency (milliseconds)
Apple MLX ████ 0.2ms
PyTorch MPS ████████████████████████████████████████████████ 45ms
PyTorch CPU ████████████████████████████████████████████████████████████████████████████████████████████████████████ 120ms
CUDA (Est.) ████████████ 12ms
Vulkan (Est.) ████████████████████████ 25ms
0ms 25ms 50ms 75ms 100ms 125ms
Maximum Throughput (texts per second)
Apple MLX ████████████████████████████████████████████████████████████████████████████████████████████████████████ 35,000
CUDA (Est.) ████████████████████████████████ 8,000
PyTorch MPS ██████ 1,500
Vulkan (Est.) ████████████ 3,000
PyTorch CPU ██ 500
0 10k 20k 30k 40k
- Install and run (embeddings only)
pip install embed-rerank
# Minimal .env
cat > .env <<'ENV'
BACKEND=auto
MODEL_NAME=mlx-community/Qwen3-Embedding-4B-4bit-DWQ
PORT=9000
HOST=0.0.0.0
ENV
embed-rerank # http://localhost:9000Want 2560‑D vectors by default? Add this to .env and restart:
cat >> .env <<'ENV'
# Use the model hidden_size (2560 for Qwen3-Embedding-4B) as output dimension
DIMENSION_STRATEGY=hidden_size
# Or enforce a fixed size (pads/truncates as needed):
# OUTPUT_EMBEDDING_DIMENSION=2560
# DIMENSION_STRATEGY=pad_or_truncate
ENV
# Verify
curl -s http://localhost:9000/api/v1/embed/ \
-H 'Content-Type: application/json' \
-d '{"texts":["hello"],"normalize":true}' | jq '.vectors[0] | length'- Try it (embeddings + simple rerank)
# Embeddings (Native)
curl -s http://localhost:9000/api/v1/embed/ \
-H 'Content-Type: application/json' \
-d '{"texts":["Hello MLX","Apple Silicon rocks"]}' | jq '.embeddings | length'
# Rerank fallback (no dedicated reranker yet)
curl -s http://localhost:9000/api/v1/rerank/ \
-H 'Content-Type: application/json' \
-d '{"query":"capital of france","documents":["Paris is the capital of France","Berlin is in Germany"],"top_n":2}' | jq '.results[0]'- Add a dedicated reranker (better quality)
cat >> .env <<'ENV'
RERANKER_BACKEND=auto
RERANKER_MODEL_ID=cross-encoder/ms-marco-MiniLM-L-6-v2 # Torch (stable)
# MLX experimental v1 also available: vserifsaglam/Qwen3-Reranker-4B-4bit-MLX
ENV
# Restart server, then call Native or OpenAI-compatible rerank
curl -s http://localhost:9000/api/v1/rerank/ \
-H 'Content-Type: application/json' \
-d '{"query":"capital of france","documents":["Paris is the capital of France","Berlin is in Germany"],"top_n":2}' | jq '.results[0]'- (Optional) Run as a macOS service
# Uses your .env to generate a LaunchAgent and start the service
./tools/setup-macos-service.sh
# Check status and health
launchctl list | grep com.embed-rerank.server
open http://localhost:9000/health/Notes
- OpenAI drop-in supported for both embeddings and rerank (/v1/embeddings, /v1/rerank). See docs for a tiny SDK example.
- Scores may be auto‑sigmoid‑normalized for OpenAI clients by default (disable via
OPENAI_RERANK_AUTO_SIGMOID=false). - The root endpoint
/shows bothembedding_dimension(served) andhidden_size(model config) for clarity.
Run the full validation suite
./tools/server-tests.sh --full- Deployment profiles (Embeddings‑only, Fallback rerank, Dedicated reranker): docs/DEPLOYMENT_PROFILES.md
- OpenAI usage (tiny example + options): docs/ENHANCED_OPENAI_API.md
- Quality benchmarks (JSONL/CSV judgments): docs/QUALITY_BENCHMARKS.md
- Troubleshooting: docs/TROUBLESHOOTING.md
- Backend specs and performance: docs/BACKEND_TECHNICAL_SPECS.md, docs/PERFORMANCE_COMPARISON_CHARTS.md
import openai
client = openai.OpenAI(base_url="http://localhost:9000/v1", api_key="dummy")
# Embeddings
res = client.embeddings.create(model="text-embedding-ada-002", input=["hello world"])
print(len(res.data[0].embedding))
# Rerank (OpenAI-compatible)
rr = client._request(
"post",
"/v1/openai/rerank",
json={
"query": "capital of france",
"documents": [
{"id": "a", "text": "Paris is the capital of France"},
{"id": "b", "text": "Berlin is in Germany"},
],
"top_n": 2,
},
)
print(rr.get("results", rr))| Framework | Tests | |
|---|---|---|
| ✅ | Open WebUI | Embed |
| ✅ | LightRAG | Embed Rerank |
| ✅ | continue.dev | Embed Rerank |
| ✅ | Kilo Code | Embed |
MIT License – build amazing things locally.