Discussion about jina v2 PR #17885

pockers21 · 2025-12-09T09:43:52Z

pockers21
Dec 9, 2025

Jina CLIP v2 is a multimodal embedding model, not a multimodal generation model.
As stated in the official card:
jina-clip-v2 is a general-purpose multilingual multimodal embedding model for text & images.
The typical usage pattern is to compute text and image embeddings separately, and then compare them in vector space
(e.g., cosine similarity).
- The official inference entrypoint is:
  https://huggingface.co/jinaai/jina-clip-v2
  Jina v2 is a mixed model that contains:
  - a text encoder,
  - query LoRA head, the one that best fits the Jina v2 use case, the other LoRA heads are not suitable for the v2. This is also why the official model chooses to fix this particular LoRA inside Jina v2.
  - a vision encoder.
    At inference time, the implementation loads the relevant parts depending on whether you call encode_text or
    encode_image.
- The text encoder inside CLIP v2 actually comes from:
  https://huggingface.co/jinaai/jina-embeddings-v3
  CLIP v2 takes the v3 text model and activates one commonly used LoRA head.
  In contrast, jina-embeddings-v3 itself ships with five LoRA heads.
  CLIP v2 picks one of these tasks and add it in its mixed model.
———

Current state of this PR
- I initially split the CLIP v2 mixed model into a text part and a vision part.
  - For the text side, I experimented with merging the most common LoRA head into the base model.
  - For the vision side, I bypassed the main mtmd pipeline and implemented image embedding directly in llama-mtmd-
    cli, because mtmd is designed around multimodal generation, not pure image embedding.
    - For example, mtmd_init_from_file explicitly requires a text model, which does not fit the Jina CLIP v2
      embedding-only logic.
    - In Jina’s intended workflow, text embeddings are computed by a text embedding model, image embeddings by
      the vision encoder, and similarity is computed in embedding space.
- Concretely, for Jina CLIP v2:
  - Text embeddings should be computed by something like llama-embedding.
  - Image embeddings should be computed by the CLIP v2 vision encoder.
  - The similarity is computed outside the model.
- This is why, in llama-mtmd-cli, I added a path that directly uses the Jina vision encoder for image embedding,
  instead of forcing CLIP v2 through the full mtmd multimodal generation flow.
———

Proposed final usage pattern

After considering the reviewers’ feedback, I think the most reasonable and llama.cpp‑friendly usage pattern is:
1. Text side — use jina-embeddings-v3
  - Since CLIP v2’s text encoder is conceptually the v3 text model with one specific LoRA task, users should
    convert jinaai/jina-embeddings-v3.
  - This will produce:
    - one base text GGUF, and
    - five LoRA GGUFs (one for each task).
  - Text embeddings should be computed with llama-embedding, using the base GGUF plus the desired LoRA GGUF.
2. Vision side — convert only the CLIP v2 vision encoder to GGUF
  - Users convert the CLIP v2 weights to an image GGUF that contains only the original vision encoder part of the
    model (even though the source HF checkpoint is mixed text+vision).
  - This GGUF is then used for pure image embedding.
3. Inference — combine text and image embeddings externally
  - At inference time:
    - Use llama-embedding with jina-embeddings-v3 base GGUF + a specific LoRA GGUF to compute text embeddings.
    - Use the converted CLIP v2 vision GGUF (via llama-mtmd-cli ) to compute image embeddings.
  - Finally, compute similarity between text and image embeddings in vector space, matching the original Jina CLIP
    v2 usage pattern.

This guarantees compatibility with the official v2 model .

pockers21 · 2025-12-09T11:57:40Z

pockers21
Dec 9, 2025
Author

I need to correct a previous statement.
Earlier I assumed that the text LoRA in v2 came from one of the LoRA heads in jina-embeddings-v3. This is not
correct. After comparing the weights, the v2 LoRA and all LoRA heads in jina-embeddings-v3 do not match numerically.

It appears that the v2 LoRA was trained separately, and is not directly derived from any single LoRA head in jina-
embeddings-v3.

Therefore, if we want to fully match the official implementation, it seems we should convert the v2 model itself and
produce three files: a text GGUF, a LoRA GGUF, and a vision GGUF. However, this would require adding a relatively
complex, v2‑specific conversion path to the convert_hf_to_gguf.py script.

1 reply

CISC Dec 9, 2025
Collaborator

Therefore, if we want to fully match the official implementation, it seems we should convert the v2 model itself and produce three files: a text GGUF, a LoRA GGUF, and a vision GGUF. However, this would require adding a relatively complex, v2‑specific conversion path to the convert_hf_to_gguf.py script.

It's not really specific to this model, what we need to do is add support for sub-models using hf_model_name_or_path and hf_model_config_kwargs. Should not be too complex, but might invoke some unexpected downloads for the user.

ngxson · 2025-12-10T14:15:40Z

ngxson
Dec 10, 2025
Collaborator

One of the thing that will be quite messy about this model is that it doesn't allow mixing text and vision. This is unlike traditional multimodal LLM where there is a projector to "bridge" between the text & vision part.

A llama-context-less mtmd can be an option as you suggested in the last PR, but I would prefer the implementation to be simpler. I'll have a look and push a PR for that.

But eventually I don't believe that future models will follow this approach. In real-world scenario, PDFs commonly contains both text and image, and it doesn't make sense to generate embeddings separately for each image or text. Researchers from Jina seems to be too rush on releasing their model and skipped this use case together.

8 replies

ngxson Dec 11, 2025
Collaborator

Honestly I don't quite care about low or high level or business logic that you said. I already worked in the project long enough to know that if something doesn't work, user will definitely complain. You cannot expect everyone to know everything about every model. Someone will eventually complain about why they cannot send both text and images into embeddings.

But still, I'm not convinced about your last PR, it is unusable on llama-server. Unless you resolve the problem with language model, I don't think we can make it work on llama-server

pockers21 Dec 11, 2025
Author

Honestly I don't quite care about low or high level or business logic that you said. I already worked in the project long enough to know that if something doesn't work, user will definitely complain. You cannot expect everyone to know everything about every model. Someone will eventually complain about why they cannot send both text and images into embeddings.

But still, I'm not convinced about your last PR, it is unusable on llama-server. Unless you resolve the problem with language model, I don't think we can make it work on llama-server

What you mean “unless you resolve the problem with language model”?
If we want llama-server to use v2, the reasonable way would be via the /embeddings, not /completions.

That would mean adding support on the server side for an mtmd embedding-only model.
Is this what you were asking for?

ngxson Dec 11, 2025
Collaborator

Have you ever looked into llama-server code?

That would mean adding support on the server side for an mtmd embedding-only model.

Will you willing to make a patch that allow it to work without llama-context, while still keeping it safe and simple to maintain?

It will be very difficult to maintain a llama-context-less implementation for both mtmd-cli and llama-server as it requires a lot of code paths to be guard. So, I won't reply further if you keep insisting on a mtmd-only approach.

pockers21 Dec 11, 2025
Author

Have you ever looked into llama-server code?

That would mean adding support on the server side for an mtmd embedding-only model.

Will you willing to make a patch that allow it to work without llama-context, while still keeping it safe and simple to maintain?

It will be very difficult to maintain a llama-context-less implementation for both mtmd-cli and llama-server as it requires a lot of code paths to be guard. So, I won't reply further if you keep insisting on a mtmd-only approach.

I’m not sure how you concluded that I don’t want llama_context or that I’m pushing for a llama-context-less
implementation. I never said I’m against keeping llama_context or maintaining both. My point is very simple: I’d like /embeddings to also support v2 vision inference. Keeping only llama_context is reasonable. I don’t know why you got that impression.

ngxson Dec 11, 2025
Collaborator

Hmm ok sorry I misinterpreted your wordings.

That would mean adding support on the server side for an mtmd embedding-only model.

I thought you mean have a server implementation that only use mtmd and no llama-context.

I'm ok for adding such API. Some models with CLS (like llama 4) can already work with this logic. But I'll need to see if and how other models can support this. Ideally, we can try to fallback to a polling method that works on all models.

If everything is good, potential you won't need to do anything else on the vision side apart from converting jina to gguf.

Discussion about jina v2 PR #17885

Uh oh!

Uh oh!

pockers21 Dec 9, 2025

Replies: 2 comments · 9 replies

Uh oh!

pockers21 Dec 9, 2025 Author

Uh oh!

CISC Dec 9, 2025 Collaborator

Uh oh!

ngxson Dec 10, 2025 Collaborator

Uh oh!

ngxson Dec 11, 2025 Collaborator

Uh oh!

pockers21 Dec 11, 2025 Author

Uh oh!

Uh oh!

ngxson Dec 11, 2025 Collaborator

Uh oh!

Uh oh!

pockers21 Dec 11, 2025 Author

Uh oh!

Uh oh!

ngxson Dec 11, 2025 Collaborator

pockers21
Dec 9, 2025

Replies: 2 comments 9 replies

pockers21
Dec 9, 2025
Author

CISC Dec 9, 2025
Collaborator

ngxson
Dec 10, 2025
Collaborator

ngxson Dec 11, 2025
Collaborator

pockers21 Dec 11, 2025
Author

ngxson Dec 11, 2025
Collaborator

pockers21 Dec 11, 2025
Author

ngxson Dec 11, 2025
Collaborator