jinaai/jina-embeddings-v5-text-small-retrieval

german text

by kalle07 - opened Feb 21

•

I have a proven pipeline... works for embedders like snowflake, nomic, bge-m3, gte, also jina-embeddings-v3 ...
i only changed your embedder v5 (dont know which one retrieval or cluster) or one for embedding one for query ? You don't describe it.
anyway ... checked cross vs ... for me bad results! only not matching snippets are found.
How should I use the embedder?

michael-guenther

Jina AI org Feb 23

Hi,
if you are using this model (which only supports retrieval not the other tasks) you can either use vLLM, llama.cpp or sentence transformers.
sentence-transformers might be the easiest one of them, if you use python. You just need to follow the example on the README which adds the prompt to.thr text that you pass via the prompt_name.argument (either "query" or "document"). From your message, it is hard to say whether you did something wrong with using the model, there is a big in our code, or if your use case is too specific / does not suit this model. If you provide some example of what is matched and what should be matched instead, I could look deeper into it. If your task is more symmetric (queries and documents look similar to each other), it could be worth trying the text matching adapter.

Here is the sentence-transformers example from the READMe (model card):

from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer(
    "jinaai/jina-embeddings-v5-text-small-retrieval",
    model_kwargs={"dtype": torch.bfloat16},  # Recommended for GPUs
    config_kwargs={"_attn_implementation": "flash_attention_2"},  # Recommended but optional
)
# Optional: set truncate_dim in encode() to control embedding size

query = "Which planet is known as the Red Planet?"
documents = [
    "Venus is often called Earth's twin because of its similar size and proximity.",
    "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
    "Jupiter, the largest planet in our solar system, has a prominent red spot.",
    "Saturn, famous for its rings, is sometimes mistaken for the Red Planet.",
]

# Encode query and documents
query_embeddings = model.encode(sentences=query, prompt_name="query")
document_embeddings = model.encode(sentences=documents, prompt_name="document")
print(query_embeddings.shape, document_embeddings.shape)
# (1024,) (4, 1024)

similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)
# tensor([[0.4860, 0.7611, 0.5914, 0.6188]])

kalle07

Feb 23

•

edited Feb 23

Hello,
ok i must ask again, which one of V5 is for embedding part? if the "retrieval" is obviously for the query part. if its the clustering model ... I think have not yet used prompt_name.
I use llama.cpp (gguf) and i can add prompt_name="query" or "document", but in my first test, the order of the results is very similar when I omit these at least on many other embedders.

michael-guenther

Jina AI org Feb 23

Ah ok, if you use llama.cpp, you should use

llama-server -hf jinaai/jina-embeddings-v5-text-small-retrieval:F16 --embedding --pooling last -ub 32768

On the server side and use this one for both, queries and documents, but add the prompt manually, i.e., prepended either "Query: " or "Document: " to the texts you are passing as in the examples. Even if your texts are not in English you should still use "Query: " or "Document: ". I think prompt_name does not work for llama.cpp

For example with curl a request would look like this:

curl -X POST "http://127.0.0.1:8080/v1/embeddings" \
  -H "Content-Type: application/json" \
  -d '{
    "input": [
      "Query: A beautiful sunset over the beach",
      "Query: Un beau coucher de soleil sur la plage",
      "Document: 海滩上美丽的日落",
      "Document: 浜辺に沈む美しい夕日",
      "Document: Golden sunlight melts into the horizon, painting waves in warm amber and rose, while the sky whispers goodnight to the quiet, endless sea."
    ]
  }'

kalle07

Feb 24

•

edited Feb 24

my bad ... all is fine, was my code...

-pooling last is 512 token limit?

and one question for speed, v5 it is a 0.6b model like v3 but it is 3 times slower in embedding part like qwen3-06, because its based on right?

the other "embedder model" eg -matching- are good for to compare two snippets for example?
i read not all the PDF but maybe you can copy some important parts for explantion to the repo ;)
4.2.2 Text Matching (STS) Adapter
We designed the text-matching adapter for semantic
text similarity (STS) tasks, i.e., tasks where both
text inputs are treated symmetrically, unlike
asymmetric retrieval. This makes the adapter ideal
for use cases like duplicate detection, paraphrase
identification, or quantifying the similarity of
documents in general.
To achieve better symmetric encoding, this
adapter uses only the "Document:" prefix during
training and inference.

For assignment, classification and clustering, you can provide short code examples ... and what would the output look like?
Example: I have defined 5 clusters ... all chunks are sorted by cluster topic oc with a rating or distance number?
lets say the result are 1000 chunks, now i can make a more deep search on those?

michael-guenther

Jina AI org Feb 24

-pooling last is 512 token limit?

No, the token limit is usually 32k (when using llama-server I specified -ub 32768)

and one question for speed, v5 it is a 0.6b model like v3 but it is 3 times slower in embedding part like qwen3-06, because its based on right?

I haven't tested it but I would be a bit surprised if it is 3 times slower. v5-small has more active parameters (~500M vs 300M for v3) and a few more layers but if it is 3 times slower that might be due to some optimizations that are not activated, e.g., when you use v3 with flash attention but v5 without flash attention. If you use a larger maximum sequence length (token limit) for v5 it can also significantly slow down the embedding generation.

For assignment, classification and clustering, you can provide short code examples ... and what would the output look like?
Example: I have defined 5 clusters ... all chunks are sorted by cluster topic oc with a rating or distance number?
lets say the result are 1000 chunks, now i can make a more deep search on those?

For classification: the embedding model serves as a feature extractor, i.e., you encode the all instances in your training data into embeddings and train a classification model that predicts from the embeddings their classes. Then you can encode new texts (test data / live data) with the embedding model and apply the classification model on it to determine its class

For clustering: You apply the embedding model to all the documents or text chunks that you want to cluster. Then you apply a clustering algorithm (e.g. k-means) that assigns them to groups so that you have your texts grouped without any supervised data (labels, etc.). A typical application for this is data exploration, e.g., you want to get an overview what kind of texts are in your datasets by automatically grouping them by their semantic.

kalle07

Feb 24

•

edited Feb 24

thx for explanation!

pooling_last runs! it is the best option for query and answer?

and speed, yes i only changed the model no more... qwen was always slow even with lm-studio

qwen3-06 and v5
init: embeddings required but some input tokens were not marked as outputs -> overriding
llama_perf_context_print: load time = 4.21 ms
llama_perf_context_print: prompt eval time = 59.46 ms / 409 tokens ( 0.15 ms per token, 6878.34 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 64.72 ms / 410 tokens
llama_perf_context_print: graphs reused = 0

v3
init: embeddings required but some input tokens were not marked as outputs -> overriding
llama_perf_context_print: load time = 1.20 ms
llama_perf_context_print: prompt eval time = 11.73 ms / 368 tokens ( 0.03 ms per token, 31383.25 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 12.78 ms / 369 tokens
llama_perf_context_print: graphs reused = 0

michael-guenther

Jina AI org Feb 24

pooling_last runs! it is the best option for query and answer?

Yes

and speed, yes i only changed the model no more... qwen was always slow even with lm-studio

Ok probably llama.cpp has optimized their v3 implementation more with respect of embedding generation while qwen-0.6B is rather optimized for text generation.

michael-guenther changed discussion status to closed Mar 2

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment