MLX-LM OpenAI Compatible API Server

A FastAPI-based server that provides OpenAI-compatible API endpoints for running Large Language Models (LLMs) using MLX on Apple Silicon. This server wraps the MLX-LM library to provide a familiar OpenAI API interface.

Features

🚀 OpenAI API Compatible: Drop-in replacement for OpenAI API /v1/completions endpoint
🍎 Optimized for Apple Silicon: Leverages MLX for efficient inference on M1/M2/M3 chips
📡 Streaming Support: Real-time token streaming for responsive applications
🔄 Dynamic Model Loading: Load any MLX-compatible model from Hugging Face
🌐 CORS Enabled: Ready for web application integration
⚡ Fast API: Built on FastAPI for high performance and automatic API documentation

Prerequisites

macOS with Apple Silicon (M1/M2/M3)
Python 3.8 or higher
Git

Installation

1. Clone the Repository

# Clone via SSH (after setting up SSH keys)
git clone git@github.com:yourusername/mlx_server.git
cd mlx_server

# Or clone via HTTPS
git clone https://github.com/yourusername/mlx_server.git
cd mlx_server

2. Install Dependencies

# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On macOS/Linux

# Install dependencies
pip install -r requirements.txt

Usage

Basic Usage

Run the server with the default model (Llama-3.2-3B-Instruct-4bit):

uvicorn main:app --host 0.0.0.0 --port 8000

Specify a Different Model

You can specify any MLX-compatible model from Hugging Face:

# Using command line argument
python main.py --model mlx-community/Mistral-7B-Instruct-v0.3-4bit

# Or using environment variable
MODEL_ID=mlx-community/Mistral-7B-Instruct-v0.3-4bit uvicorn main:app

Available Command Line Options

python main.py --help

Options:
  --model MODEL_ID    Model to load (default: mlx-community/Llama-3.2-3B-Instruct-4bit)
  --host HOST         Host to bind to (default: 0.0.0.0)
  --port PORT         Port to bind to (default: 8000)
  --reload           Enable auto-reload for development

API Endpoints

1. List Models

GET /v1/models

Example:

curl http://localhost:8000/v1/models

2. Create Completion

POST /v1/completions

Example (non-streaming):

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
    "prompt": "Once upon a time",
    "max_tokens": 100,
    "temperature": 0.7
  }'

Example (streaming):

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
    "prompt": "Write a poem about AI",
    "max_tokens": 200,
    "temperature": 0.9,
    "stream": true
  }'

3. Health Check

GET /health

Example:

curl http://localhost:8000/health

OpenAI SDK Compatibility

This server is compatible with the OpenAI Python SDK. Simply point the SDK to your local server:

from openai import OpenAI

# Point to local server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # API key not required for local server
)

# Use as normal
response = client.completions.create(
    model="mlx-community/Llama-3.2-3B-Instruct-4bit",
    prompt="Hello, how are you?",
    max_tokens=50
)

print(response.choices[0].text)

Supported Models

Any model available in the MLX Community on Hugging Face is supported. Popular options include:

mlx-community/Llama-3.2-3B-Instruct-4bit (default)
mlx-community/Mistral-7B-Instruct-v0.3-4bit
mlx-community/Phi-3.5-mini-instruct-4bit
mlx-community/Qwen2.5-7B-Instruct-4bit
mlx-community/gemma-2-2b-it-4bit

API Parameters

The /v1/completions endpoint supports the following parameters:

Parameter	Type	Default	Description
`model`	string	required	Model ID to use
`prompt`	string/array	required	The prompt(s) to generate completions for
`max_tokens`	integer	100	Maximum tokens to generate
`temperature`	float	0.8	Sampling temperature (0-2)
`top_p`	float	0.95	Nucleus sampling parameter
`n`	integer	1	Number of completions (currently supports 1)
`stream`	boolean	false	Enable streaming response
`stop`	string/array	null	Stop sequences
`presence_penalty`	float	0.0	Presence penalty (-2 to 2)
`frequency_penalty`	float	0.0	Frequency penalty (-2 to 2)
`echo`	boolean	false	Include prompt in response

Interactive API Documentation

FastAPI provides automatic interactive API documentation:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Development

Running with Auto-reload

For development, use the --reload flag:

python main.py --reload

Testing the Server

# Test root endpoint
curl http://localhost:8000/

# Test model listing
curl http://localhost:8000/v1/models

# Test completion
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mlx-community/Llama-3.2-3B-Instruct-4bit", "prompt": "Hello", "max_tokens": 20}'

Docker Support (Optional)

Create a Dockerfile:

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY main.py .

EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Build and run:

docker build -t mlx-server .
docker run -p 8000:8000 -e MODEL_ID=mlx-community/Llama-3.2-3B-Instruct-4bit mlx-server

Troubleshooting

Model Loading Issues

Ensure you have sufficient RAM for the model you're trying to load
Check your internet connection for downloading models from Hugging Face
Verify the model ID is correct and exists in the MLX Community

Performance Issues

For large models, consider using quantized versions (4-bit or 8-bit)
Close other applications to free up memory
Use the --max-kv-size parameter in the model loading for long contexts

API Errors

Check the server logs for detailed error messages
Verify the request format matches the OpenAI API specification
Ensure all required parameters are provided

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

MLX - Machine learning framework for Apple Silicon
MLX-LM - LLM package for MLX
FastAPI - Modern web framework for building APIs
Hugging Face - Model repository and community

Support

For issues and questions:

Check the Issues page
Create a new issue with detailed information about your problem
Join the MLX community discussions

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

MLX-LM OpenAI Compatible API Server

Features

Prerequisites

Installation

1. Clone the Repository

2. Install Dependencies

Usage

Basic Usage

Specify a Different Model

Available Command Line Options

API Endpoints

1. List Models

2. Create Completion

3. Health Check

OpenAI SDK Compatibility

Supported Models

API Parameters

Interactive API Documentation

Development

Running with Auto-reload

Testing the Server

Docker Support (Optional)

Troubleshooting

Model Loading Issues

Performance Issues

API Errors

Contributing

License

Acknowledgments

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages