Skip to content

Performance gap when enabling tuning_mm_mlp (src vs. adding VGGT) #42

@Terry-Xu-666

Description

@Terry-Xu-666

Hi VGLLM team, thanks a lot for the excellent work!

I’m currently experimenting with introducing VGGT into another base model. In my setup:
Base model: nvila-8b
Data: VLM3R’s vsi dataset only
Settings:
• On the source model, I enabled tuning_mm_mlp (i.e., MM projector in the code) and LoRA on the LMM backbone.
• On the VGGT-integrated model, I enabled tuning_mm_mlp, the fusion module, and LoRA on the backbone.
• I have tried multiple fusion strategies.

Observation:
Under these settings, the model with VGGT consistently underperforms the source model by a large margin (about 1–7 points).

Question:
Have you tried enabling tuning_mm_mlp in your experiments? Would this observation imply that fine-tuning the vision encoder might yield better results than introducing VGGT?

Any insights into this phenomenon would be greatly appreciated.

Next steps:
I’m currently also running experiments where the projector is frozen and strictly aligned with the same fusion module (as in your paper’s setting), and I will update the results once they are ready.

Thanks in advance for the community’s help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions