arxiv:2603.19223

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Published on Mar 19

· Submitted by

Ziyin Zhang on Mar 20

CodeFuse AI

Upvote

Authors:

Abstract

F2LLM-v2 is a multilingual embedding model family trained on 60 million samples across 200+ languages, achieving superior performance through LLM-based training, matryoshka learning, pruning, and distillation techniques.

AI-generated summary

We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

View arXiv page View PDF Project page Add to collection

Community

Geralt-Targaryen

Paper submitter 2 days ago

F2LLM-v2 is a family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a curated composite of 60 million publicly available high-quality data, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. F2LLM-v2-14B ranks 1st on 11 MTEB benchmarks: European, Scandinavian, German, French, Polish, Dutch, Indic, Japanese, Persian, Vietnamese, and Code.