Performance gap when enabling tuning_mm_mlp (src vs. adding VGGT)

Hi VGLLM team, thanks a lot for the excellent work!

I’m currently experimenting with introducing VGGT into another base model. In my setup:
Base model: nvila-8b
Data: VLM3R’s vsi dataset only
Settings:
	•	On the source model, I enabled tuning_mm_mlp (i.e., MM projector in the code) and LoRA on the LMM backbone.
	•	On the VGGT-integrated model, I enabled tuning_mm_mlp, the fusion module, and LoRA on the backbone.
	•	I have tried multiple fusion strategies.

Observation:
Under these settings, the model with VGGT consistently underperforms the source model by a large margin (about 1–7 points).

Question:
Have you tried enabling tuning_mm_mlp in your experiments? Would this observation imply that fine-tuning the vision encoder might yield better results than introducing VGGT?

Any insights into this phenomenon would be greatly appreciated.

Next steps:
I’m currently also running experiments where the projector is frozen and strictly aligned with the same fusion module (as in your paper’s setting), and I will update the results once they are ready.

Thanks in advance for the community’s help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance gap when enabling tuning_mm_mlp (src vs. adding VGGT) #42

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance gap when enabling tuning_mm_mlp (src vs. adding VGGT) #42

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions