Embedding model for VietNamese
Collection
3 items β’ Updated
Vietnamese_Embedding is an embedding model fine-tuned from the BGE-M3 model (https://huggingface.co/BAAI/bge-m3) to enhance retrieval capabilities for Vietnamese.
from sentence_transformers import SentenceTransformer
import torch
model = SentenceTransformer("AITeamVN/Vietnamese_Embedding")
model.max_seq_length = 2048
sentences_1 = ["TrΓ tuα» nhΓ’n tαΊ‘o lΓ gΓ¬", "Lợi Γch cα»§a giαΊ₯c ngα»§"]
sentences_2 = ["TrΓ tuα» nhΓ’n tαΊ‘o lΓ cΓ΄ng nghα» giΓΊp mΓ‘y mΓ³c suy nghΔ© vΓ hα»c hα»i nhΖ° con ngΖ°α»i. NΓ³ hoαΊ‘t Δα»ng bαΊ±ng cΓ‘ch thu thαΊp dα»― liα»u, nhαΊn diα»n mαΊ«u vΓ ΔΖ°a ra quyαΊΏt Δα»nh.",
"GiαΊ₯c ngα»§ giΓΊp cΖ‘ thα» vΓ nΓ£o bα» nghα» ngΖ‘i, hα»i phα»₯c nΔng lượng vΓ cαΊ£i thiα»n trΓ nhα». Ngα»§ Δα»§ giαΊ₯c giΓΊp tinh thαΊ§n tα»nh tΓ‘o vΓ lΓ m viα»c hiα»u quαΊ£ hΖ‘n."]
query_embedding = model.encode(sentences_1)
doc_embeddings = model.encode(sentences_2)
similarity = query_embedding @ doc_embeddings.T
print(similarity)
'''
array([[0.66212064, 0.33066642],
[0.25866613, 0.5865289 ]], dtype=float32)
'''
| Model | Accuracy@1 | Accuracy@3 | Accuracy@5 | Accuracy@10 | MRR@10 |
|---|---|---|---|---|---|
| Vietnamese_Reranker | 0.7944 | 0.9324 | 0.9537 | 0.9740 | 0.8672 |
| Vietnamese_Embedding_v2 | 0.7262 | 0.8927 | 0.9268 | 0.9578 | 0.8149 |
| Vietnamese_Embedding (public) | 0.7274 | 0.8992 | 0.9305 | 0.9568 | 0.8181 |
| Vietnamese-bi-encoder (BKAI) | 0.7109 | 0.8680 | 0.9014 | 0.9299 | 0.7951 |
| BGE-M3 | 0.5682 | 0.7728 | 0.8382 | 0.8921 | 0.6822 |
Vietnamese_Reranker and Vietnamese_Embedding_v2 was trained on 1100000 triplets.
Although the score on the legal domain drops a bit on Vietnamese_Embedding_v2, since this phase data is much larger, it is very good for other domains.
You can access 2 model via link: Vietnamese_Embedding_v2, Vietnamese_Reranker
You can reproduce the evaluation result by running code python evaluation_model.py (data downloaded from Kaggle).
Email: nguyennhotrung3004@gmail.com
Developer
Member: Nguyα» n Nho Trung, Nguyα» n NhαΊt Quang, Nguyen Van Huy
@misc{Vietnamese_Embedding,
title={Vietnamese_Embedding: Embedding model in Vietnamese language.},
author={Nguyen Nho Trung, Nguyen Nhat Quang, Nguyen Van Huy},
year={2025},
publisher={Huggingface},
}