Skip to content

ModalityDance/PhysTool-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project Logo

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

Welcome to PhysTool-Bench! 👋

PhysTool-Bench is a benchmark that evaluates how well MLLMs perceive, select, and sequence physical tools in real-world scenes.

It consists of 2 tasks that separate visual recognition from functional planning:

  • Task I – Tool Recognition: List all visible tools in a cluttered scene (image only).
  • Task II – Tool Selection & Planning: Given an real scenario image + a brief task instruction, output the ordered sequence of required tools.

📊 Dataset at a Glance

Key Numbers Value
Total queries (image+task+answers) 2,510
Unique physical tools 2,678
Tools per scene 8.6 (3.1 required, 5.5 distractors)

🪐 Key Features

  • Two‑Task Design: Decouples recognition (all visible tools) from planning (select + order).
  • Real‑World Tool Variety: Across 57 categories (manufacturing, healthcare, farming, etc.).
  • Challenging Distractors: 3–10 visually/functionally similar decoys per scene.
  • Rich Evaluation Metrics: EM, TCR, SR@k, and fine‑grained error analysis.
Overview
Quick Overview of PhysTool-Bench.

📑 Table of Contents

🚀 Quick Start

1. Installation

Since different models require conflicting versions of transformers and other libraries, we provide separate Conda environments for running different model families. Choose the one that matches the model you want to evaluate.

If you prefer using models via API (e.g., GPT-4o), you can skip the environment setup and directly run the inference scripts with your API key.

Conda (recommended)

For Open-Flamingo

conda create -n flamingo_env python=3.10 -y
eval "$(conda shell.bash hook)" && conda activate flamingo_env

pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.28.1
pip install open-flamingo==2.0.1 --no-deps
pip install einops einops-exts open_clip_torch huggingface-hub Pillow accelerate sentencepiece

For mPLUG-Owl3

conda create -n mPLUG_env python=3.10 -y
eval "$(conda shell.bash hook)" && conda activate mPLUG_env

pip install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers==4.40.2
pip install icecream einops 'accelerate>=0.26.0' pillow

For MiniCPM

conda create -n minicpm_env python=3.10 -y
eval "$(conda shell.bash hook)" && conda activate minicpm_env

pip install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install "transformers<5.0.0" "timm>=1.0.0" "accelerate>=1.0.0"
pip install sentencepiece pillow decord einops minicpmo hyperpyyaml speechbrain librosa onnx onnxruntime-gpu

Hardware Requirements (recommended to fill)

  • GPU: Recommended for training and inference (CUDA-compatible)
  • Python: 3.10
  • CUDA: 12.1 / 12.4
  • Frameworks: PyTorch 2.4.0, Transformers (version varies per model), Accelerate

2. Download datasets

chmod +x scripts/download_data.sh
./scripts/download_data.sh

or download manually from:

3. Inference

Task I Inference

Throught API:

python scripts/inference/task_i_api.py -t YOUR_API_KEY -m <model_name> -i data/generation_checkpoint.json -o results --delay 1

# For example:
python scripts/inference/task_i_api.py -t sk-xxxx -m gpt-4o -i data/generation_checkpoint.json --delay 1

For local models:

python scripts/inference/task_i_<model_name>.py

# For example:
python scripts/inference/task_i_Openflamingo9B.py

Task II Inference

Throught API:

python scripts/inference/task_ii_api.py -t YOUR_API_KEY -m <model_name> -i data/generation_checkpoint.json -o results --delay 2

# For example:
python scripts/inference/task_ii_api.py -t sk-xxxx -m gpt-4o -i data/generation_checkpoint.json -o results --delay 2

For local models:

python scripts/inference/task_ii_<model_name>.py

# For example:
python scripts/inference/task_ii_Openflamingo9B.py

4. Evaluation

Task I

python scripts/evaluation/eval_tool_finding.py \
    --model <model_name> \
    --ground-truth data/corrected_tools.json \
    --predictions results/all_tools_identified_<model_name>.json \    # predicted tool lists from Task I inference
    --output-json results/eval_tool_finding_<model_name>.json \       # where evaluation results will be saved
    --match-method {fuzzy|strict}

# For example:
python scripts/evaluation/eval_tool_finding.py \
    --model gpt-4o \
    --ground-truth data/corrected_tools.json \
    --predictions results/all_tools_identified_gpt-4o.json \
    --output-json results/eval_tool_finding_gpt-4o.json \
    --match-method fuzzy

Task II

Use Gemini as a Judge (by API):

python scripts/evaluation/eval_gemini.py -t YOUR_API_KEY -m <model_name> \
    -r results/task_ii_results_<model_name>.json \
    -o results/evaluation_of_<model_name>_with_gemini.json \
    -k 1,2,3

# For example:
python scripts/evaluation/eval_gemini.py -t sk-xxxx -m MiniCPM \
    -r results/task_ii_results_MiniCPM.json \
    -o results/evaluation_of_MiniCPM_with_gemini.json \
    -k 1,2,3

Use exsiting matching pairs:

python scripts/evaluation/eval_offline.py -m <model_name> \
    -r results/task_ii_results_<model_name>.json \
    -o results/evaluation_of_<model_name>_with_gemini.json \
    -k 1,2,3

# For example:
python scripts/evaluation/eval_offline.py -m MiniCPM \
    -r results/task_ii_results_MiniCPM.json \
    -o results/evaluation_of_MiniCPM_with_gemini.json \
    -k 1,2,3

✨ How It Works

PhysTool-Bench is built through controlled expansion and iterative refinement — starting from a seed set of tools, growing organically, and verifying every step.

Pipeline
Three-stage construction (left) and two-task evaluation (right).

🏗️ 1. Tool Bank: Grow from Seeds, Not Everything at Once

  • Start with 310 manually curated tools, then iteratively expand.
  • Recycle novel distractors generated during query creation back into the bank.
  • → Covers 2,678 tools across 57 categories, avoids artificial “tool spotting”, and ensures broad + balanced coverage.

🔍 2. Query Generation + QC: Relentless Refinement

  • Distractors: 3–10 per scene, visually or functionally similar to targets. 86.9% tasks require strict order.
  • Three QC stages (LLM necessity audit → programmatic alignment → human visual review) to remove ambiguity, artificial cues, or physically unrealistic images.
  • → Every query has a clear, verifiable ground truth (humans reach 75% EM on familiar tasks).

🧪 3. Two‑Task Evaluation: Pinpoint Where Models Fail

  • Task I (Recognition) – image only → list all visible tools. Measures pure visual enumeration.
  • Task II (Planning) – image + instruction → ordered required tools. Measures functional mapping + sequencing.
  • If a model sees correctly (Task I) but plans poorly (Task II), the bottleneck is physical commonsense, not vision.

🌱 Acknowledgements

An example: We would like to thank the contributors, open-source projects, and research communities whose work made PhysTool-Bench possible. This project builds upon ideas, tools, and datasets developed by the broader machine learning and information retrieval ecosystem.

This project is licensed under the MIT License. Please refer to the LICENSE file for full details.

📚 Citation

If you use PhysTool-Bench in your research or applications, please consider citing:

@article{PhysTool-Bench2026,
  title        = {Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use},
  author       = {Zhixin Ma and Yutong Zhou and Yongqi Li and Chong-Wah Ngo and Wenjie Li},
  journal      = {arXiv preprint arXiv:2606.10803},
  year         = {2026}
}

Thank you for visiting PhysTool-Bench!

About

"Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors