GitHub - MooreThreads/vllm-musa: A high-throughput and memory-efficient inference and serving engine for LLMs

vLLM Hardware Plugin for Moore Threads MUSA

English | 中文

About

The vLLM Hardware Plugin for Moore Threads MUSA integrates Moore Threads (MUSA) GPUs with vLLM to enable high-performance large language model inference. It follows the [RFC]: Hardware pluggable and [RFC]: Enhancing vLLM Plugin Architecture principles, providing a modular interface for Moore Threads MUSA hardware.

The plugin leverages the following key components:

torchada: CUDA→MUSA compatibility layer for PyTorch — run CUDA code on MUSA with zero code changes
mthreads-ml-py: Moore Threads Management Library (MTML) Python bindings for device management and queries
MATE: MUSA AI Tensor Engine — high-performance computing library optimized for LLM inference on MUSA architecture
torch_musa: PyTorch backend for Moore Threads (MUSA) GPUs — extends PyTorch with native MUSA device support

Requirements

Python: 3.9 or higher
Hardware: Moore Threads (MUSA) GPU with MUSA toolkit installed
Dependencies:
- torchada — CUDA→MUSA compatibility layer
- mthreads-ml-py — MTML Python bindings (pymtml)
- MATE — MUSA AI Tensor Engine
- torch_musa — PyTorch backend for MUSA GPUs

Getting Started

Supported Versions

vLLM Version	PyTorch Version	Engine	Status
0.17.0	2.7.1	V1 only	✅ Supported

Note: This plugin uses vLLM's V1 engine architecture. V0 engine is not supported.

Install from Source

Clone the repository:

git clone https://github.com/MooreThreads/vllm-musa.git
cd vllm-musa

Install vLLM Hardware Plugin for Moore Threads MUSA:

# Standard installation (installs vLLM MUSA plugin and vLLM)
pip install . --no-build-isolation -v

# Or editable installation for development
pip install -e . --no-build-isolation -v

Verify the installation:

# Check plugin registration
python -c "from vllm_musa import musa_platform_plugin; print('Plugin loaded successfully')"

# Check MTML device management
python -c "from vllm_musa.musa import mtml_available; print(f'MTML available: {mtml_available}')"

Environment Variables

Variable	Description
`MUSA_VISIBLE_DEVICES`	Control which MUSA devices are visible (similar to `CUDA_VISIBLE_DEVICES`)
`VLLM_WORKER_MULTIPROC_METHOD=spawn`	Recommended for multi-process workers
`VLLM_MUSA_CUSTOM_OP_USE_NATIVE`	Use vLLM custom ops native implementation (default: `False`)

Usage

Once installed, the plugin is automatically detected by vLLM. Simply run vLLM as usual:

from vllm import LLM, SamplingParams

# vLLM will automatically use the MUSA platform
llm = LLM(model="your-model-path", trust_remote_code=True)

sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=100)
outputs = llm.generate(["Hello, how are you?"], sampling_params)

for output in outputs:
    print(output.outputs[0].text)

OpenAI-Compatible Server

# Start the server
vllm serve /path/to/model/

# Test completions API
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/path/to/model/", "prompt": "Hello!", "max_tokens": 50}'

# Test chat completions API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/path/to/model/", "messages": [{"role": "user", "content": "What is 2+2?"}], "max_tokens": 50}'

Testing

Run the test suite:

# Run all tests
make test

# Run specific test file
pytest tests/test_musa.py -v
pytest tests/test_patches.py -v

# Run with coverage
make test-cov

Project Structure

vllm-musa/
├── pyproject.toml              # Project configuration
├── README.md                   # Documentation (English)
├── README_CN.md                # Documentation (中文)
├── LICENSE                     # Apache 2.0 License
├── example/                    # Usage examples
├── csrc/                       # C/C++ source files
├── docs/                       # Additional documentation
├── vllm_musa/                  # Main package
│   ├── __init__.py             # Plugin entry point
│   ├── musa.py                 # MUSA platform implementation
│   └── patches/                # Runtime compatibility patches
│       ├── __init__.py         # Patch application logic
│       └── *.patch.py          # Individual patch files
└── tests/                      # Test suite
    ├── conftest.py             # Pytest fixtures
    ├── test_musa.py            # Platform tests
    └── test_patches.py         # Patch system tests

Runtime Patches

The plugin includes runtime patches to ensure compatibility with upstream vLLM. For details on the patching mechanism, see patches/README.md.

Contributing

We welcome and value any contributions and collaborations. Please set up pre-commit hooks to ensure code quality before submitting:

# Install pre-commit
pip install pre-commit

# Install the git hooks
pre-commit install

# (Optional) Run against all files
pre-commit run --all-files

Once installed, the hooks will automatically run on every commit, checking for:

Trailing whitespace and file formatting
Import sorting (isort)
Code formatting (black)
Linting (ruff)
Spelling errors (codespell)
Common issues (merge conflicts, debug statements, large files, etc.)

You can also run checks manually:

make pre-commit    # Run pre-commit hooks on all files
make test          # Run tests
make test-cov      # Run tests with coverage

Contact Us

For technical questions and feature requests, please use GitHub Issues.
When reporting a bug, please include your environment information by running vllm_collect_env (or python -m vllm_musa.collect_env) and pasting the output in your issue.

Related Projects

Project	Description
vLLM	High-throughput LLM serving engine
torchada	CUDA→MUSA compatibility layer for PyTorch
torch_musa	PyTorch support for Moore Threads GPUs
MATE	MUSA AI Tensor Engine for LLM acceleration
mthreads-ml-py	MTML Python bindings

License

This project is licensed under the Apache License 2.0.

Copyright (c) 2026 Moore Threads Technology Co., Ltd. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
csrc		csrc
docs		docs
example/disaggregated_serving		example/disaggregated_serving
tests		tests
vllm_musa		vllm_musa
.codespellrc		.codespellrc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README_CN.md		README_CN.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vLLM Hardware Plugin for Moore Threads MUSA

About

Requirements

Getting Started

Supported Versions

Install from Source

Environment Variables

Usage

OpenAI-Compatible Server

Testing

Project Structure

Runtime Patches

Contributing

Contact Us

Related Projects

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vLLM Hardware Plugin for Moore Threads MUSA

About

Requirements

Getting Started

Supported Versions

Install from Source

Environment Variables

Usage

OpenAI-Compatible Server

Testing

Project Structure

Runtime Patches

Contributing

Contact Us

Related Projects

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages