Welcome to PhysTool-Bench! 👋
PhysTool-Bench is a benchmark that evaluates how well MLLMs perceive, select, and sequence physical tools in real-world scenes.
It consists of 2 tasks that separate visual recognition from functional planning:
- Task I – Tool Recognition: List all visible tools in a cluttered scene (image only).
- Task II – Tool Selection & Planning: Given an real scenario image + a brief task instruction, output the ordered sequence of required tools.
| Key Numbers | Value |
|---|---|
| Total queries (image+task+answers) | 2,510 |
| Unique physical tools | 2,678 |
| Tools per scene | 8.6 (3.1 required, 5.5 distractors) |
- Two‑Task Design: Decouples recognition (all visible tools) from planning (select + order).
- Real‑World Tool Variety: Across 57 categories (manufacturing, healthcare, farming, etc.).
- Challenging Distractors: 3–10 visually/functionally similar decoys per scene.
- Rich Evaluation Metrics: EM, TCR, SR@k, and fine‑grained error analysis.
Since different models require conflicting versions of transformers and other libraries, we provide separate Conda environments for running different model families. Choose the one that matches the model you want to evaluate.
If you prefer using models via API (e.g., GPT-4o), you can skip the environment setup and directly run the inference scripts with your API key.
For Open-Flamingo
conda create -n flamingo_env python=3.10 -y
eval "$(conda shell.bash hook)" && conda activate flamingo_env
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.28.1
pip install open-flamingo==2.0.1 --no-deps
pip install einops einops-exts open_clip_torch huggingface-hub Pillow accelerate sentencepiece
For mPLUG-Owl3
conda create -n mPLUG_env python=3.10 -y
eval "$(conda shell.bash hook)" && conda activate mPLUG_env
pip install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers==4.40.2
pip install icecream einops 'accelerate>=0.26.0' pillow
For MiniCPM
conda create -n minicpm_env python=3.10 -y
eval "$(conda shell.bash hook)" && conda activate minicpm_env
pip install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install "transformers<5.0.0" "timm>=1.0.0" "accelerate>=1.0.0"
pip install sentencepiece pillow decord einops minicpmo hyperpyyaml speechbrain librosa onnx onnxruntime-gpu
- GPU: Recommended for training and inference (CUDA-compatible)
- Python: 3.10
- CUDA: 12.1 / 12.4
- Frameworks: PyTorch 2.4.0, Transformers (version varies per model), Accelerate
chmod +x scripts/download_data.sh
./scripts/download_data.shor download manually from:
Throught API:
python scripts/inference/task_i_api.py -t YOUR_API_KEY -m <model_name> -i data/generation_checkpoint.json -o results --delay 1
# For example:
python scripts/inference/task_i_api.py -t sk-xxxx -m gpt-4o -i data/generation_checkpoint.json --delay 1For local models:
python scripts/inference/task_i_<model_name>.py
# For example:
python scripts/inference/task_i_Openflamingo9B.pyThrought API:
python scripts/inference/task_ii_api.py -t YOUR_API_KEY -m <model_name> -i data/generation_checkpoint.json -o results --delay 2
# For example:
python scripts/inference/task_ii_api.py -t sk-xxxx -m gpt-4o -i data/generation_checkpoint.json -o results --delay 2For local models:
python scripts/inference/task_ii_<model_name>.py
# For example:
python scripts/inference/task_ii_Openflamingo9B.pypython scripts/evaluation/eval_tool_finding.py \
--model <model_name> \
--ground-truth data/corrected_tools.json \
--predictions results/all_tools_identified_<model_name>.json \ # predicted tool lists from Task I inference
--output-json results/eval_tool_finding_<model_name>.json \ # where evaluation results will be saved
--match-method {fuzzy|strict}
# For example:
python scripts/evaluation/eval_tool_finding.py \
--model gpt-4o \
--ground-truth data/corrected_tools.json \
--predictions results/all_tools_identified_gpt-4o.json \
--output-json results/eval_tool_finding_gpt-4o.json \
--match-method fuzzyUse Gemini as a Judge (by API):
python scripts/evaluation/eval_gemini.py -t YOUR_API_KEY -m <model_name> \
-r results/task_ii_results_<model_name>.json \
-o results/evaluation_of_<model_name>_with_gemini.json \
-k 1,2,3
# For example:
python scripts/evaluation/eval_gemini.py -t sk-xxxx -m MiniCPM \
-r results/task_ii_results_MiniCPM.json \
-o results/evaluation_of_MiniCPM_with_gemini.json \
-k 1,2,3Use exsiting matching pairs:
python scripts/evaluation/eval_offline.py -m <model_name> \
-r results/task_ii_results_<model_name>.json \
-o results/evaluation_of_<model_name>_with_gemini.json \
-k 1,2,3
# For example:
python scripts/evaluation/eval_offline.py -m MiniCPM \
-r results/task_ii_results_MiniCPM.json \
-o results/evaluation_of_MiniCPM_with_gemini.json \
-k 1,2,3PhysTool-Bench is built through controlled expansion and iterative refinement — starting from a seed set of tools, growing organically, and verifying every step.
- Start with 310 manually curated tools, then iteratively expand.
- Recycle novel distractors generated during query creation back into the bank.
- → Covers 2,678 tools across 57 categories, avoids artificial “tool spotting”, and ensures broad + balanced coverage.
- Distractors: 3–10 per scene, visually or functionally similar to targets. 86.9% tasks require strict order.
- Three QC stages (LLM necessity audit → programmatic alignment → human visual review) to remove ambiguity, artificial cues, or physically unrealistic images.
- → Every query has a clear, verifiable ground truth (humans reach 75% EM on familiar tasks).
- Task I (Recognition) – image only → list all visible tools. Measures pure visual enumeration.
- Task II (Planning) – image + instruction → ordered required tools. Measures functional mapping + sequencing.
- If a model sees correctly (Task I) but plans poorly (Task II), the bottleneck is physical commonsense, not vision.
An example: We would like to thank the contributors, open-source projects, and research communities whose work made PhysTool-Bench possible. This project builds upon ideas, tools, and datasets developed by the broader machine learning and information retrieval ecosystem.
- 🖼️ Image Generation – Nano Banana Pro (synthetic scene rendering)
- 🧠 Open‑weight Models
- 💻 Code & Libraries – 🤗 Transformers, vLLM, PyTorch, PIL, requests
- 📚 Dataset & Classification – UNSPSC, manual annotation & QC team
- 📊 Inference & Evaluation – vLLM, custom evaluation scripts (offline, Gemini‑based, fuzzy matching)
This project is licensed under the MIT License. Please refer to the LICENSE file for full details.
If you use PhysTool-Bench in your research or applications, please consider citing:
@article{PhysTool-Bench2026,
title = {Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use},
author = {Zhixin Ma and Yutong Zhou and Yongqi Li and Chong-Wah Ngo and Wenjie Li},
journal = {arXiv preprint arXiv:2606.10803},
year = {2026}
}

