Qwen3-Reranker-4B — OpenVINO IR (INT8 weight-only, asymmetric)

This is a redistribution. For the model's intended use, instruction format, full evaluation (MTEB-R / CMTEB-R / MMTEB-R / MLDR / MTEB-Code / FollowIR), and citation, please see the upstream card: Qwen/Qwen3-Reranker-4B.

OpenVINO IR conversion of Qwen/Qwen3-Reranker-4B, weight-only quantized to INT8 asymmetric via NNCF. Intended for the OpenArc reranker engine and optimum-intel pipelines targeting Intel CPUs / iGPUs / dGPUs / NPUs.

Files

openvino_model.{xml,bin} — Qwen3 (4B) decoder, INT8 weights (~4.0 GB)
openvino_tokenizer.{xml,bin} / openvino_detokenizer.{xml,bin} — OpenVINO Tokenizers IR
chat_template.jinja, generation_config.json
Standard HF tokenizer files: tokenizer.json, tokenizer_config.json, special_tokens_map.json, vocab.json, merges.txt
LICENSE, NOTICE — Apache-2.0 with attribution to the upstream Qwen Team.

Architecture


Base model	Qwen3 ForCausalLM (Qwen3-4B-Base)
Hidden size	2560
Layers	36
Attention heads / KV heads	32 / 8
Max position	40 960
Vocabulary	151 669
Source dtype	bfloat16
Quantization	NNCF INT8 weight-only, asymmetric

Usage with OpenArc

openarc add qwen3-4b-reranker \
  --model-path /path/to/Qwen3-Reranker-4B-int8-ov \
  --model-type rerank \
  --engine optimum \
  --device GPU

openarc serve
# POST /v1/rerank  {"model": "qwen3-4b-reranker", "query": "...", "documents": [...]}

Conversion notes

The standard CLI route currently fails silently for this model on optimum-intel @ HEAD:

optimum-cli export openvino --weight-format int8 \
  --model Qwen/Qwen3-Reranker-4B ./out
# exits 0; openvino_model.xml is 0 bytes, openvino_model.bin is a ~13 MB stub

The Python API path produces a usable model:

from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig

quant = OVWeightQuantizationConfig(bits=8, sym=False)
m = OVModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-Reranker-4B",
    export=True,
    quantization_config=quant,
    trust_remote_code=True,
)
m.save_pretrained("./out")

Tokenizer / detokenizer IR generated separately via openvino_tokenizers.convert_tokenizer(..., with_detokenizer=True).

A more detailed walkthrough lives in OpenArc/docs/openvino_qwen3.md.

License

Apache-2.0, inherited from Qwen/Qwen3-Reranker-4B. See LICENSE and NOTICE in this repo.

Citation

From the upstream model card:

@article{qwen3embedding,
  title={Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models},
  author={Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and Zhang, Xin and Lin, Huan and Yang, Baosong and Xie, Pengjun and Yang, An and Liu, Dayiheng and Lin, Junyang and Huang, Fei and Zhou, Jingren},
  journal={arXiv preprint arXiv:2506.05176},
  year={2025}
}

Downloads last month: 43

Inference Providers NEW

Text Ranking

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kread/Qwen3-Reranker-4B-int8-ov

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-Reranker-4B

Quantized

(50)

this model

Paper for kread/Qwen3-Reranker-4B-int8-ov

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Paper • 2506.05176 • Published Jun 5, 2025 • 82