---
license: apache-2.0
language:
- my
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- generated_from_trainer
- myanmar
- burmese
- nlp
library_name: sentence-transformers
dataset_size: 1000000
loss: MSELoss
base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
widget:
- source_sentence: >-
    ▁ထို အလုပ်ရုံ သည် ▁ကျနော် ၏ ▁ကိုယ်ရေး အချက်အလက် များကို ဖတ်ရှု ကာ
    ▁မေးခွန်းများ ▁မေး ကာ ▁ကျနော့်ကို ▁ဝယ် လိုက်ပါတော့သည်။
  sentences:
  - >-
    ▁ထုံးတမ်းစဉ်လာ ▁လေး ပါး တွင် ▁ကံ ▁၊ ▁တရား ▁၊ ▁သ မ် စာ ▁၊ ▁မော သံ ▁နှင့်
    ▁ယောဂ ▁အမျိုးအစား ▁အမျိုးမျိုး တို့ ▁ပါဝင် သည်။
  - >-
    ▁ကိုယ်ပိုင် ဟန် ၊ ▁ကိုယ်ပိုင် ဒီဇိုင်း ၊ ▁ကိုယ်ပိုင် စိတ်ကူး ၊ ▁ကိုယ်ပိုင်
    ဖန်တီး မှုကို ▁ပြသ သည့် ▁ဝတ်စုံ များကို ▁ဒီဇိုင်နာ ▁မ မီး မီး က ▁ပန်းချီကား
    တစ်ချပ် သဖွယ် ▁ဖန်တီး သူဖြစ်သည်။
datasets:
- DatarrX/myX-Mega-Corpus
---

# 📝 myX-Semantic: A Burmese Sentence Embedding Model

## Model Description
**myX-Semantic** is a sentence-transformer model fine-tuned for the Burmese (Myanmar) language. It maps sentences and paragraphs into a **768-dimensional dense vector space**. 

This model is built using a **Knowledge Distillation** approach. It utilizes a `paraphrase-multilingual-MiniLM-L12-v2` student architecture, which has been trained to mimic the high-dimensional output of a larger teacher model (`paraphrase-multilingual-mpnet-base-v2`). To ensure compatibility with the teacher's embeddings, a dedicated Dense layer was integrated to project the student's native 384-dimensions into the final 768-dimensional space.

### Key Applications
*   **Semantic Textual Similarity (STS):** Measuring how similar two sentences are in meaning.
*   **Semantic Search:** Retrieving relevant documents based on intent rather than keywords.
*   **Text Classification & Clustering:** Grouping similar Burmese texts based on their semantic vectors.
*   **Information Retrieval:** Finding answers or paraphrases in large Burmese datasets.

## Development & Distribution
*   **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
*   **Published by:** [DatarrX (Myanmar Open Source NGO)](https://huggingface.co/DatarrX)
*   **Training Dataset:** [DatarrX/myX-Mega-Corpus](https://huggingface.co/datasets/DatarrX/myX-Mega-Corpus) (1 Million Rows)
*   **Tokenization:** Processed using [DatarrX/myX-Tokenizer](https://huggingface.co/DatarrX/myX-Tokenizer).

## Technical Specifications
- **Base Model:** `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`
- **Max Sequence Length:** 512 tokens
- **Output Dimension:** 768 dimensions
- **Similarity Function:** Cosine Similarity
- **Loss Function:** MSELoss (Mean Squared Error)

### Model Architecture
```text
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_mean_tokens': True})
  (2): Dense({'in_features': 384, 'out_features': 768, 'bias': True, 'activation_function': 'Identity'})
)
```

## Usage

### Installation
```bash
pip install -U sentence-transformers
```

### Direct Usage (Inference)
```python
from sentence_transformers import SentenceTransformer, util

# Load the model
model = SentenceTransformer("DatarrX/myX-Semantic")

# Define sentences
sentences = [
    "သူနှင့် ကျွန်မ ခဏ ငြိမ်နေလိုက်၏။",
    "ကျွန်တော်တို့ အတူတူ ထိုင်နေကြသည်။",
    "နည်းပညာသည် လူသားတို့အတွက် အရေးကြီးသည်။"
]

# Compute embeddings
embeddings = model.encode(sentences)

# Compute similarity scores
similarities = model.similarity(embeddings, embeddings)
print(similarities)
```

## Implementation Guidelines (Thresholds)
When using this model for similarity detection or semantic search, the choice of a similarity threshold is crucial for balancing precision and recall. Based on empirical testing:

*   **Recommended Threshold:** A Cosine Similarity score of **0.60 or higher** is recommended to determine a strong semantic match.
*   **Comparison:** Compared to lighter models (e.g., 500K-row variants), this 1M-row model exhibits higher confidence in its vector representations. While lower-capacity models might require a threshold around 0.40, **myX-Semantic** is optimized for a more distinctive separation at the 0.60 level.

## Training Details
*   **Samples:** 1,000,000 training pairs.
*   **Batch Size:** 64
*   **Learning Rate:** 3e-5
*   **Optimizer:** AdamW with `round_robin` batch sampling.
*   **Teacher Model:** `paraphrase-multilingual-mpnet-base-v2` (768-dim).

### Training Logs
| Epoch | Step | Training Loss |
| :--- | :--- | :--- |
| 0.06 | 500 | 0.0086 |
| 0.25 | 2000 | 0.0045 |
| 0.64 | 5000 | 0.0031 |
| 0.96 | 7500 | 0.0028 |

## Limitations & Bias
*   **Language:** This model is specifically optimized for Unicode Burmese. It may not perform accurately with Zawgyi-encoded text.
*   **Data Bias:** The model reflects the patterns and biases found in the `myX-Mega-Corpus`. Users should validate results for specific sensitive domains.

## License
This model is licensed under the **Apache License 2.0**. You are free to use it for research and commercial purposes, provided appropriate credit is given.

## Citation
If you find this model useful in your project, please cite it:
```bibtex
@software{khantsintheinn2026myxsemantic,
  author = {Khant Sint Heinn},
  title = {myX-Semantic: A Burmese Sentence Embedding Model},
  year = {2026},
  publisher = {DatarrX},
  url = {https://huggingface.co/DatarrX/myX-Semantic}
}
```
## About the Author

**Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.

He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.

Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.

His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.

**Connect with the Author:**  
[GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis)