A FastAPI-based server that provides OpenAI-compatible API endpoints for running Large Language Models (LLMs) using MLX on Apple Silicon. This server wraps the MLX-LM library to provide a familiar OpenAI API interface.
- 🚀 OpenAI API Compatible: Drop-in replacement for OpenAI API
/v1/completionsendpoint - 🍎 Optimized for Apple Silicon: Leverages MLX for efficient inference on M1/M2/M3 chips
- 📡 Streaming Support: Real-time token streaming for responsive applications
- 🔄 Dynamic Model Loading: Load any MLX-compatible model from Hugging Face
- 🌐 CORS Enabled: Ready for web application integration
- ⚡ Fast API: Built on FastAPI for high performance and automatic API documentation
- macOS with Apple Silicon (M1/M2/M3)
- Python 3.8 or higher
- Git
# Clone via SSH (after setting up SSH keys)
git clone git@github.com:yourusername/mlx_server.git
cd mlx_server
# Or clone via HTTPS
git clone https://github.com/yourusername/mlx_server.git
cd mlx_server# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On macOS/Linux
# Install dependencies
pip install -r requirements.txtRun the server with the default model (Llama-3.2-3B-Instruct-4bit):
uvicorn main:app --host 0.0.0.0 --port 8000You can specify any MLX-compatible model from Hugging Face:
# Using command line argument
python main.py --model mlx-community/Mistral-7B-Instruct-v0.3-4bit
# Or using environment variable
MODEL_ID=mlx-community/Mistral-7B-Instruct-v0.3-4bit uvicorn main:apppython main.py --help
Options:
--model MODEL_ID Model to load (default: mlx-community/Llama-3.2-3B-Instruct-4bit)
--host HOST Host to bind to (default: 0.0.0.0)
--port PORT Port to bind to (default: 8000)
--reload Enable auto-reload for developmentGET /v1/modelsExample:
curl http://localhost:8000/v1/modelsPOST /v1/completionsExample (non-streaming):
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
"prompt": "Once upon a time",
"max_tokens": 100,
"temperature": 0.7
}'Example (streaming):
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
"prompt": "Write a poem about AI",
"max_tokens": 200,
"temperature": 0.9,
"stream": true
}'GET /healthExample:
curl http://localhost:8000/healthThis server is compatible with the OpenAI Python SDK. Simply point the SDK to your local server:
from openai import OpenAI
# Point to local server
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # API key not required for local server
)
# Use as normal
response = client.completions.create(
model="mlx-community/Llama-3.2-3B-Instruct-4bit",
prompt="Hello, how are you?",
max_tokens=50
)
print(response.choices[0].text)Any model available in the MLX Community on Hugging Face is supported. Popular options include:
mlx-community/Llama-3.2-3B-Instruct-4bit(default)mlx-community/Mistral-7B-Instruct-v0.3-4bitmlx-community/Phi-3.5-mini-instruct-4bitmlx-community/Qwen2.5-7B-Instruct-4bitmlx-community/gemma-2-2b-it-4bit
The /v1/completions endpoint supports the following parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
string | required | Model ID to use |
prompt |
string/array | required | The prompt(s) to generate completions for |
max_tokens |
integer | 100 | Maximum tokens to generate |
temperature |
float | 0.8 | Sampling temperature (0-2) |
top_p |
float | 0.95 | Nucleus sampling parameter |
n |
integer | 1 | Number of completions (currently supports 1) |
stream |
boolean | false | Enable streaming response |
stop |
string/array | null | Stop sequences |
presence_penalty |
float | 0.0 | Presence penalty (-2 to 2) |
frequency_penalty |
float | 0.0 | Frequency penalty (-2 to 2) |
echo |
boolean | false | Include prompt in response |
FastAPI provides automatic interactive API documentation:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
For development, use the --reload flag:
python main.py --reload# Test root endpoint
curl http://localhost:8000/
# Test model listing
curl http://localhost:8000/v1/models
# Test completion
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "mlx-community/Llama-3.2-3B-Instruct-4bit", "prompt": "Hello", "max_tokens": 20}'Create a Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY main.py .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]Build and run:
docker build -t mlx-server .
docker run -p 8000:8000 -e MODEL_ID=mlx-community/Llama-3.2-3B-Instruct-4bit mlx-server- Ensure you have sufficient RAM for the model you're trying to load
- Check your internet connection for downloading models from Hugging Face
- Verify the model ID is correct and exists in the MLX Community
- For large models, consider using quantized versions (4-bit or 8-bit)
- Close other applications to free up memory
- Use the
--max-kv-sizeparameter in the model loading for long contexts
- Check the server logs for detailed error messages
- Verify the request format matches the OpenAI API specification
- Ensure all required parameters are provided
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- MLX - Machine learning framework for Apple Silicon
- MLX-LM - LLM package for MLX
- FastAPI - Modern web framework for building APIs
- Hugging Face - Model repository and community
For issues and questions:
- Check the Issues page
- Create a new issue with detailed information about your problem
- Join the MLX community discussions