Skip to content

fatihbugrakdogan/mlx_openai_server

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLX-LM OpenAI Compatible API Server

A FastAPI-based server that provides OpenAI-compatible API endpoints for running Large Language Models (LLMs) using MLX on Apple Silicon. This server wraps the MLX-LM library to provide a familiar OpenAI API interface.

Features

  • 🚀 OpenAI API Compatible: Drop-in replacement for OpenAI API /v1/completions endpoint
  • 🍎 Optimized for Apple Silicon: Leverages MLX for efficient inference on M1/M2/M3 chips
  • 📡 Streaming Support: Real-time token streaming for responsive applications
  • 🔄 Dynamic Model Loading: Load any MLX-compatible model from Hugging Face
  • 🌐 CORS Enabled: Ready for web application integration
  • Fast API: Built on FastAPI for high performance and automatic API documentation

Prerequisites

  • macOS with Apple Silicon (M1/M2/M3)
  • Python 3.8 or higher
  • Git

Installation

1. Clone the Repository

# Clone via SSH (after setting up SSH keys)
git clone git@github.com:yourusername/mlx_server.git
cd mlx_server

# Or clone via HTTPS
git clone https://github.com/yourusername/mlx_server.git
cd mlx_server

2. Install Dependencies

# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On macOS/Linux

# Install dependencies
pip install -r requirements.txt

Usage

Basic Usage

Run the server with the default model (Llama-3.2-3B-Instruct-4bit):

uvicorn main:app --host 0.0.0.0 --port 8000

Specify a Different Model

You can specify any MLX-compatible model from Hugging Face:

# Using command line argument
python main.py --model mlx-community/Mistral-7B-Instruct-v0.3-4bit

# Or using environment variable
MODEL_ID=mlx-community/Mistral-7B-Instruct-v0.3-4bit uvicorn main:app

Available Command Line Options

python main.py --help

Options:
  --model MODEL_ID    Model to load (default: mlx-community/Llama-3.2-3B-Instruct-4bit)
  --host HOST         Host to bind to (default: 0.0.0.0)
  --port PORT         Port to bind to (default: 8000)
  --reload           Enable auto-reload for development

API Endpoints

1. List Models

GET /v1/models

Example:

curl http://localhost:8000/v1/models

2. Create Completion

POST /v1/completions

Example (non-streaming):

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
    "prompt": "Once upon a time",
    "max_tokens": 100,
    "temperature": 0.7
  }'

Example (streaming):

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
    "prompt": "Write a poem about AI",
    "max_tokens": 200,
    "temperature": 0.9,
    "stream": true
  }'

3. Health Check

GET /health

Example:

curl http://localhost:8000/health

OpenAI SDK Compatibility

This server is compatible with the OpenAI Python SDK. Simply point the SDK to your local server:

from openai import OpenAI

# Point to local server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # API key not required for local server
)

# Use as normal
response = client.completions.create(
    model="mlx-community/Llama-3.2-3B-Instruct-4bit",
    prompt="Hello, how are you?",
    max_tokens=50
)

print(response.choices[0].text)

Supported Models

Any model available in the MLX Community on Hugging Face is supported. Popular options include:

  • mlx-community/Llama-3.2-3B-Instruct-4bit (default)
  • mlx-community/Mistral-7B-Instruct-v0.3-4bit
  • mlx-community/Phi-3.5-mini-instruct-4bit
  • mlx-community/Qwen2.5-7B-Instruct-4bit
  • mlx-community/gemma-2-2b-it-4bit

API Parameters

The /v1/completions endpoint supports the following parameters:

Parameter Type Default Description
model string required Model ID to use
prompt string/array required The prompt(s) to generate completions for
max_tokens integer 100 Maximum tokens to generate
temperature float 0.8 Sampling temperature (0-2)
top_p float 0.95 Nucleus sampling parameter
n integer 1 Number of completions (currently supports 1)
stream boolean false Enable streaming response
stop string/array null Stop sequences
presence_penalty float 0.0 Presence penalty (-2 to 2)
frequency_penalty float 0.0 Frequency penalty (-2 to 2)
echo boolean false Include prompt in response

Interactive API Documentation

FastAPI provides automatic interactive API documentation:

Development

Running with Auto-reload

For development, use the --reload flag:

python main.py --reload

Testing the Server

# Test root endpoint
curl http://localhost:8000/

# Test model listing
curl http://localhost:8000/v1/models

# Test completion
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mlx-community/Llama-3.2-3B-Instruct-4bit", "prompt": "Hello", "max_tokens": 20}'

Docker Support (Optional)

Create a Dockerfile:

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY main.py .

EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Build and run:

docker build -t mlx-server .
docker run -p 8000:8000 -e MODEL_ID=mlx-community/Llama-3.2-3B-Instruct-4bit mlx-server

Troubleshooting

Model Loading Issues

  • Ensure you have sufficient RAM for the model you're trying to load
  • Check your internet connection for downloading models from Hugging Face
  • Verify the model ID is correct and exists in the MLX Community

Performance Issues

  • For large models, consider using quantized versions (4-bit or 8-bit)
  • Close other applications to free up memory
  • Use the --max-kv-size parameter in the model loading for long contexts

API Errors

  • Check the server logs for detailed error messages
  • Verify the request format matches the OpenAI API specification
  • Ensure all required parameters are provided

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • MLX - Machine learning framework for Apple Silicon
  • MLX-LM - LLM package for MLX
  • FastAPI - Modern web framework for building APIs
  • Hugging Face - Model repository and community

Support

For issues and questions:

  1. Check the Issues page
  2. Create a new issue with detailed information about your problem
  3. Join the MLX community discussions

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages