Replies: 2 comments 9 replies
-
|
I need to correct a previous statement. It appears that the v2 LoRA was trained separately, and is not directly derived from any single LoRA head in jina- Therefore, if we want to fully match the official implementation, it seems we should convert the v2 model itself and |
Beta Was this translation helpful? Give feedback.
-
|
One of the thing that will be quite messy about this model is that it doesn't allow mixing text and vision. This is unlike traditional multimodal LLM where there is a projector to "bridge" between the text & vision part. A But eventually I don't believe that future models will follow this approach. In real-world scenario, PDFs commonly contains both text and image, and it doesn't make sense to generate embeddings separately for each image or text. Researchers from Jina seems to be too rush on releasing their model and skipped this use case together. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Jina CLIP v2 is a multimodal embedding model, not a multimodal generation model.
As stated in the official card:
jina-clip-v2 is a general-purpose multilingual multimodal embedding model for text & images.
The typical usage pattern is to compute text and image embeddings separately, and then compare them in vector space
(e.g., cosine similarity).
https://huggingface.co/jinaai/jina-clip-v2
Jina v2 is a mixed model that contains:
At inference time, the implementation loads the relevant parts depending on whether you call encode_text or
encode_image.
https://huggingface.co/jinaai/jina-embeddings-v3
CLIP v2 takes the v3 text model and activates one commonly used LoRA head.
In contrast, jina-embeddings-v3 itself ships with five LoRA heads.
CLIP v2 picks one of these tasks and add it in its mixed model.
———
Current state of this PR
cli, because mtmd is designed around multimodal generation, not pure image embedding.
embedding-only logic.
the vision encoder, and similarity is computed in embedding space.
instead of forcing CLIP v2 through the full mtmd multimodal generation flow.
———
Proposed final usage pattern
After considering the reviewers’ feedback, I think the most reasonable and llama.cpp‑friendly usage pattern is:
convert jinaai/jina-embeddings-v3.
model (even though the source HF checkpoint is mixed text+vision).
v2 usage pattern.
This guarantees compatibility with the official v2 model .
Beta Was this translation helpful? Give feedback.
All reactions