Skip to content

LLM-class-group/Revisiting-3D-LLM-Benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Revisiting 3D LLM Benchmarks: Are We Really Testing 3D Capabilities?

📄 Paper   |   🌐 Website  

This is the official repository of our paper “Revisiting 3D LLM Benchmarks: Are We Really Testing 3D Capabilities?” by Jiahe Jin, Yanheng He, and Mingyan Yang.

Abstract

In this work, we identify the "2D-Cheating" problem in 3D LLM evaluation, where these tasks might be easily solved by VLMs with rendered images of point clouds, exposing ineffective evaluation of 3D LLMs' unique 3D capabilities. We test VLM performance across multiple 3D LLM benchmarks and, using this as a reference, propose principles for better assessing genuine 3D understanding. We also advocate explicitly separating 3D abilities from 1D or 2D aspects when evaluating 3D LLMs.

Usage

We conducted experiments on the above benchmarks. See the commands for reproducing each experiment as follows:

3D MM-Vet

Generating Results

python ./src/object/3dmmvet/inference.py

Evaluating Results

python ./src/object/3dmmvet/eval.py

ObjaverseXL-LVIS Caption

Generating Results

python ./src/object/objaverseXL-LVIS_caption/vlm3d.py

Evaluating Results

python ./src/object/objaverseXL-LVIS_caption/evaluate.py

Rendering Scene Point Cloud

Render BEV Images
python ./src/scene/render/parallel_render_bev.py
Render Multi View Images
python ./src/scene/render/parallel_render_multi.py

ScanQA

Generating Results

Generate BEV Results
python ./src/scene/evaluation/scanqa/generate/vlm3d.py
Generate Multi View Results
bash ./src/scene/evaluation/scanqa/generate/generate.sh

Evaluating Results

Evaluate Single View Results
python ./src/scene/evaluation/scanqa/evaluation/test.py
Evaluate HIS Results
python ./src/scene/evaluation/scanqa/evaluation/test_HIS.py
Evaluate BoN Results
python ./src/scene/evaluation/scanqa/evaluation/test_BoN.py

SQA3D

To test VLM’s performance on SQA3D, run the following command:

python ./src/scene/evaluation/sqa3d/test_sqa_vlm.py

Acknowledgement

We would like to express our sincere gratitude to Prof. Yonglu Li for his valuable guidance and support throughout this research, from topic selection to the final writing. His insightful discussions and feedback have been essential to the completion of this work. We would also like to thank Ye Wang for kindly sharing the viewpoint dataset in ScanNet.

Data Attribution

This project uses data from:

Citation

If you find this work useful, please cite our paper:

@article{revisit3dllmbenchmark,
  author       = {Jiahe Jin and Yanheng He and Mingyan Yang},
  title        = {Revisiting 3D LLM Benchmarks: Are We Really Testing 3D Capabilities?},
  year         = {2025},
  journal      = {arXiv preprint arXiv:2502.08503},
  url          = {https://arxiv.org/abs/2502.08503}
}

About

[ACL 2025] Revisiting 3D LLM Benchmarks: Are We Really Testing 3D Capabilities?

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors