--- license: apache-2.0 language: - my pipeline_tag: sentence-similarity tags: - sentence-transformers - sentence-similarity - feature-extraction - dense - generated_from_trainer - myanmar - burmese - nlp library_name: sentence-transformers dataset_size: 1000000 loss: MSELoss base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 widget: - source_sentence: >- ▁ထို အလုပ်ရုံ သည် ▁ကျနော် ၏ ▁ကိုယ်ရေး အချက်အလက် များကို ဖတ်ရှု ကာ ▁မေးခွန်းများ ▁မေး ကာ ▁ကျနော့်ကို ▁ဝယ် လိုက်ပါတော့သည်။ sentences: - >- ▁ထုံးတမ်းစဉ်လာ ▁လေး ပါး တွင် ▁ကံ ▁၊ ▁တရား ▁၊ ▁သ မ် စာ ▁၊ ▁မော သံ ▁နှင့် ▁ယောဂ ▁အမျိုးအစား ▁အမျိုးမျိုး တို့ ▁ပါဝင် သည်။ - >- ▁ကိုယ်ပိုင် ဟန် ၊ ▁ကိုယ်ပိုင် ဒီဇိုင်း ၊ ▁ကိုယ်ပိုင် စိတ်ကူး ၊ ▁ကိုယ်ပိုင် ဖန်တီး မှုကို ▁ပြသ သည့် ▁ဝတ်စုံ များကို ▁ဒီဇိုင်နာ ▁မ မီး မီး က ▁ပန်းချီကား တစ်ချပ် သဖွယ် ▁ဖန်တီး သူဖြစ်သည်။ datasets: - DatarrX/myX-Mega-Corpus --- # 📝 myX-Semantic: A Burmese Sentence Embedding Model ## Model Description **myX-Semantic** is a sentence-transformer model fine-tuned for the Burmese (Myanmar) language. It maps sentences and paragraphs into a **768-dimensional dense vector space**. This model is built using a **Knowledge Distillation** approach. It utilizes a `paraphrase-multilingual-MiniLM-L12-v2` student architecture, which has been trained to mimic the high-dimensional output of a larger teacher model (`paraphrase-multilingual-mpnet-base-v2`). To ensure compatibility with the teacher's embeddings, a dedicated Dense layer was integrated to project the student's native 384-dimensions into the final 768-dimensional space. ### Key Applications * **Semantic Textual Similarity (STS):** Measuring how similar two sentences are in meaning. * **Semantic Search:** Retrieving relevant documents based on intent rather than keywords. * **Text Classification & Clustering:** Grouping similar Burmese texts based on their semantic vectors. * **Information Retrieval:** Finding answers or paraphrases in large Burmese datasets. ## Development & Distribution * **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis) * **Published by:** [DatarrX (Myanmar Open Source NGO)](https://huggingface.co/DatarrX) * **Training Dataset:** [DatarrX/myX-Mega-Corpus](https://huggingface.co/datasets/DatarrX/myX-Mega-Corpus) (1 Million Rows) * **Tokenization:** Processed using [DatarrX/myX-Tokenizer](https://huggingface.co/DatarrX/myX-Tokenizer). ## Technical Specifications - **Base Model:** `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` - **Max Sequence Length:** 512 tokens - **Output Dimension:** 768 dimensions - **Similarity Function:** Cosine Similarity - **Loss Function:** MSELoss (Mean Squared Error) ### Model Architecture ```text SentenceTransformer( (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertModel'}) (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_mean_tokens': True}) (2): Dense({'in_features': 384, 'out_features': 768, 'bias': True, 'activation_function': 'Identity'}) ) ``` ## Usage ### Installation ```bash pip install -U sentence-transformers ``` ### Direct Usage (Inference) ```python from sentence_transformers import SentenceTransformer, util # Load the model model = SentenceTransformer("DatarrX/myX-Semantic") # Define sentences sentences = [ "သူနှင့် ကျွန်မ ခဏ ငြိမ်နေလိုက်၏။", "ကျွန်တော်တို့ အတူတူ ထိုင်နေကြသည်။", "နည်းပညာသည် လူသားတို့အတွက် အရေးကြီးသည်။" ] # Compute embeddings embeddings = model.encode(sentences) # Compute similarity scores similarities = model.similarity(embeddings, embeddings) print(similarities) ``` ## Implementation Guidelines (Thresholds) When using this model for similarity detection or semantic search, the choice of a similarity threshold is crucial for balancing precision and recall. Based on empirical testing: * **Recommended Threshold:** A Cosine Similarity score of **0.60 or higher** is recommended to determine a strong semantic match. * **Comparison:** Compared to lighter models (e.g., 500K-row variants), this 1M-row model exhibits higher confidence in its vector representations. While lower-capacity models might require a threshold around 0.40, **myX-Semantic** is optimized for a more distinctive separation at the 0.60 level. ## Training Details * **Samples:** 1,000,000 training pairs. * **Batch Size:** 64 * **Learning Rate:** 3e-5 * **Optimizer:** AdamW with `round_robin` batch sampling. * **Teacher Model:** `paraphrase-multilingual-mpnet-base-v2` (768-dim). ### Training Logs | Epoch | Step | Training Loss | | :--- | :--- | :--- | | 0.06 | 500 | 0.0086 | | 0.25 | 2000 | 0.0045 | | 0.64 | 5000 | 0.0031 | | 0.96 | 7500 | 0.0028 | ## Limitations & Bias * **Language:** This model is specifically optimized for Unicode Burmese. It may not perform accurately with Zawgyi-encoded text. * **Data Bias:** The model reflects the patterns and biases found in the `myX-Mega-Corpus`. Users should validate results for specific sensitive domains. ## License This model is licensed under the **Apache License 2.0**. You are free to use it for research and commercial purposes, provided appropriate credit is given. ## Citation If you find this model useful in your project, please cite it: ```bibtex @software{khantsintheinn2026myxsemantic, author = {Khant Sint Heinn}, title = {myX-Semantic: A Burmese Sentence Embedding Model}, year = {2026}, publisher = {DatarrX}, url = {https://huggingface.co/DatarrX/myX-Semantic} } ``` ## About the Author **Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology. He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications. Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications. His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation. **Connect with the Author:** [GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis)