Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

Welcome to PhysTool-Bench! 👋

PhysTool-Bench is a benchmark that evaluates how well MLLMs perceive, select, and sequence physical tools in real-world scenes.

It consists of 2 tasks that separate visual recognition from functional planning:

Task I – Tool Recognition: List all visible tools in a cluttered scene (image only).
Task II – Tool Selection & Planning: Given an real scenario image + a brief task instruction, output the ordered sequence of required tools.

📊 Dataset at a Glance

Key Numbers	Value
Total queries (image+task+answers)	2,510
Unique physical tools	2,678
Tools per scene	8.6 (3.1 required, 5.5 distractors)

🪐 Key Features

Two‑Task Design: Decouples recognition (all visible tools) from planning (select + order).
Real‑World Tool Variety: Across 57 categories (manufacturing, healthcare, farming, etc.).
Challenging Distractors: 3–10 visually/functionally similar decoys per scene.
Rich Evaluation Metrics: EM, TCR, SR@k, and fine‑grained error analysis.

Quick Overview of PhysTool-Bench.

🚀 Quick Start

1. Installation

Since different models require conflicting versions of transformers and other libraries, we provide separate Conda environments for running different model families. Choose the one that matches the model you want to evaluate.

If you prefer using models via API (e.g., GPT-4o), you can skip the environment setup and directly run the inference scripts with your API key.

Conda (recommended)

For Open-Flamingo

conda create -n flamingo_env python=3.10 -y
eval "$(conda shell.bash hook)" && conda activate flamingo_env

pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.28.1
pip install open-flamingo==2.0.1 --no-deps
pip install einops einops-exts open_clip_torch huggingface-hub Pillow accelerate sentencepiece

For mPLUG-Owl3

conda create -n mPLUG_env python=3.10 -y
eval "$(conda shell.bash hook)" && conda activate mPLUG_env

pip install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers==4.40.2
pip install icecream einops 'accelerate>=0.26.0' pillow

For MiniCPM

conda create -n minicpm_env python=3.10 -y
eval "$(conda shell.bash hook)" && conda activate minicpm_env

pip install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install "transformers<5.0.0" "timm>=1.0.0" "accelerate>=1.0.0"
pip install sentencepiece pillow decord einops minicpmo hyperpyyaml speechbrain librosa onnx onnxruntime-gpu

Hardware Requirements (recommended to fill)

GPU: Recommended for training and inference (CUDA-compatible)
Python: 3.10
CUDA: 12.1 / 12.4
Frameworks: PyTorch 2.4.0, Transformers (version varies per model), Accelerate

2. Download datasets

chmod +x scripts/download_data.sh
./scripts/download_data.sh

or download manually from:

https://huggingface.co/datasets/ModalityDance/PhysTool-Bench

3. Inference

Task I Inference

Throught API:

python scripts/inference/task_i_api.py -t YOUR_API_KEY -m <model_name> -i data/generation_checkpoint.json -o results --delay 1

# For example:
python scripts/inference/task_i_api.py -t sk-xxxx -m gpt-4o -i data/generation_checkpoint.json --delay 1

For local models:

python scripts/inference/task_i_<model_name>.py

# For example:
python scripts/inference/task_i_Openflamingo9B.py

Task II Inference

Throught API:

python scripts/inference/task_ii_api.py -t YOUR_API_KEY -m <model_name> -i data/generation_checkpoint.json -o results --delay 2

# For example:
python scripts/inference/task_ii_api.py -t sk-xxxx -m gpt-4o -i data/generation_checkpoint.json -o results --delay 2

For local models:

python scripts/inference/task_ii_<model_name>.py

# For example:
python scripts/inference/task_ii_Openflamingo9B.py

4. Evaluation

Task I

python scripts/evaluation/eval_tool_finding.py \
    --model <model_name> \
    --ground-truth data/corrected_tools.json \
    --predictions results/all_tools_identified_<model_name>.json \    # predicted tool lists from Task I inference
    --output-json results/eval_tool_finding_<model_name>.json \       # where evaluation results will be saved
    --match-method {fuzzy|strict}

# For example:
python scripts/evaluation/eval_tool_finding.py \
    --model gpt-4o \
    --ground-truth data/corrected_tools.json \
    --predictions results/all_tools_identified_gpt-4o.json \
    --output-json results/eval_tool_finding_gpt-4o.json \
    --match-method fuzzy

Task II

Use Gemini as a Judge (by API):

python scripts/evaluation/eval_gemini.py -t YOUR_API_KEY -m <model_name> \
    -r results/task_ii_results_<model_name>.json \
    -o results/evaluation_of_<model_name>_with_gemini.json \
    -k 1,2,3

# For example:
python scripts/evaluation/eval_gemini.py -t sk-xxxx -m MiniCPM \
    -r results/task_ii_results_MiniCPM.json \
    -o results/evaluation_of_MiniCPM_with_gemini.json \
    -k 1,2,3

Use exsiting matching pairs:

python scripts/evaluation/eval_offline.py -m <model_name> \
    -r results/task_ii_results_<model_name>.json \
    -o results/evaluation_of_<model_name>_with_gemini.json \
    -k 1,2,3

# For example:
python scripts/evaluation/eval_offline.py -m MiniCPM \
    -r results/task_ii_results_MiniCPM.json \
    -o results/evaluation_of_MiniCPM_with_gemini.json \
    -k 1,2,3

✨ How It Works

PhysTool-Bench is built through controlled expansion and iterative refinement — starting from a seed set of tools, growing organically, and verifying every step.

Three-stage construction (left) and two-task evaluation (right).

🏗️ 1. Tool Bank: Grow from Seeds, Not Everything at Once

Start with 310 manually curated tools, then iteratively expand.
Recycle novel distractors generated during query creation back into the bank.
→ Covers 2,678 tools across 57 categories, avoids artificial “tool spotting”, and ensures broad + balanced coverage.

🔍 2. Query Generation + QC: Relentless Refinement

Distractors: 3–10 per scene, visually or functionally similar to targets. 86.9% tasks require strict order.
Three QC stages (LLM necessity audit → programmatic alignment → human visual review) to remove ambiguity, artificial cues, or physically unrealistic images.
→ Every query has a clear, verifiable ground truth (humans reach 75% EM on familiar tasks).

🧪 3. Two‑Task Evaluation: Pinpoint Where Models Fail

Task I (Recognition) – image only → list all visible tools. Measures pure visual enumeration.
Task II (Planning) – image + instruction → ordered required tools. Measures functional mapping + sequencing.
If a model sees correctly (Task I) but plans poorly (Task II), the bottleneck is physical commonsense, not vision.

🌱 Acknowledgements

An example: We would like to thank the contributors, open-source projects, and research communities whose work made PhysTool-Bench possible. This project builds upon ideas, tools, and datasets developed by the broader machine learning and information retrieval ecosystem.

🖼️ Image Generation – Nano Banana Pro (synthetic scene rendering)
🧠 Open‑weight Models
💻 Code & Libraries – 🤗 Transformers, vLLM, PyTorch, PIL, requests
📚 Dataset & Classification – UNSPSC, manual annotation & QC team
📊 Inference & Evaluation – vLLM, custom evaluation scripts (offline, Gemini‑based, fuzzy matching)

This project is licensed under the MIT License. Please refer to the LICENSE file for full details.

📚 Citation

If you use PhysTool-Bench in your research or applications, please consider citing:

@article{PhysTool-Bench2026,
  title        = {Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use},
  author       = {Zhixin Ma and Yutong Zhou and Yongqi Li and Chong-Wah Ngo and Wenjie Li},
  journal      = {arXiv preprint arXiv:2606.10803},
  year         = {2026}
}

⭐ Thank you for visiting PhysTool-Bench! ⭐

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
docs		docs
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

📊 Dataset at a Glance

🪐 Key Features

📑 Table of Contents

🚀 Quick Start

1. Installation

Conda (recommended)

Hardware Requirements (recommended to fill)

2. Download datasets

3. Inference

Task I Inference

Task II Inference

4. Evaluation

Task I

Task II

✨ How It Works

🏗️ 1. Tool Bank: Grow from Seeds, Not Everything at Once

🔍 2. Query Generation + QC: Relentless Refinement

🧪 3. Two‑Task Evaluation: Pinpoint Where Models Fail

🌱 Acknowledgements

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

📊 Dataset at a Glance

🪐 Key Features

📑 Table of Contents

🚀 Quick Start

1. Installation

Conda (recommended)

Hardware Requirements (recommended to fill)

2. Download datasets

3. Inference

Task I Inference

Task II Inference

4. Evaluation

Task I

Task II

✨ How It Works

🏗️ 1. Tool Bank: Grow from Seeds, Not Everything at Once

🔍 2. Query Generation + QC: Relentless Refinement

🧪 3. Two‑Task Evaluation: Pinpoint Where Models Fail

🌱 Acknowledgements

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages