mmNLI: Burmese Text Natural Language Inference with XLM-RoBERTa
A Burmese NLI model fine-tuned from xlm-roberta-base, trained on a carefully cleaned and curated dataset, achieving 79% accuracy on the test set.
This model predicts the relationship between a premise and a hypothesis as one of:
- Entailment
- Neutral
- Contradiction
Model & Demo
- Model >> https://huggingface.co/emilyyy04/xlm-roberta-base-burmese-nli-v3
- Space Demo Link >> https://huggingface.co/spaces/emilyyy04/burmese-text-nli-xlmroberta
Model Details
- Base model:
xlm-roberta-base - Language: Burmese (Myanmar)
- Task: Natural Language Inference (NLI)
- Labels:
entailment,neutral,contradiction - Framework: Transformers / PyTorch
Dataset and its Structure
The dataset consists of ~10k [10,443] samples across three classes:
| Label | Class | Count |
|---|---|---|
| 0 | Entailment | 3,608 |
| 1 | Neutral | 3,466 |
| 2 | Contradiction | 3,369 |
and the dataset is prepared from:
- By clearning Burmese NLI data (source: [(https://huggingface.co/datasets/akhtet/myanmar-xnli)]) and additional manually created samples.
- Translated English NLI data for diversity.
- Most samples follow a 1 premise → 3 hypotheses structure. Each hypothesis has a different NLI label.
- An
genrefield is included intended for future zero-shot / cross-genre experiments later.
Preprocessing
Steps:
- Unicode normalization (NFC)
- Zawgyi detection
- Automatic conversion to Unicode if Zawgyi text is detected
- Rely on XLM-R subword tokenizer for tokenization
Data Splitting Strategy
To prevent data leakage caused by shared premises:
- Train: 70%
- Validation: 15%
- Test: 15%
Instead of random shuffling:
GroupShuffleSplit is used
Samples with the same premise always stay in the same split
Prevents:
- Premise overlap across splits
- Hypothesis leakage between train / validation / test sets
Training Setup
The model was trained using Hugging Face Transformers with the following key configurations:
- Model: facebook/xlm-roberta-base
- Epochs: Up to 10 epochs
- Early Stopping: Enabled (patience = 1)
- Learning Rate:
1.5e-5 - Batch Size:
16(train & evaluation) - Weight Decay:
0.02 - Warmup Ratio:
0.1 - FP16 Training: Enabled
- Best Model Selection: Based on F1-score
- Seed:
42
The best checkpoint was automatically loaded at the end of training using early stopping and F1-based model selection.
Evaluation Metrics
The model performance is evaluated using:
- Accuracy
- Macro F1-score
Training Results
| Epoch | Train Loss | Val Loss | Accuracy | F1 |
|---|---|---|---|---|
| 1 | 0.7579 | 0.7919 | 0.6415 | 0.6134 |
| 2 | 0.6764 | 0.6207 | 0.7291 | 0.7265 |
| 3 | 0.5801 | 0.6861 | 0.7443 | 0.7462 |
| 4 | 0.4554 | 0.6415 | 0.7481 | 0.7488 |
| 5 | 0.3848 | 0.6434 | 0.7646 | 0.7646 |
| 6 | 0.3564 | 0.7296 | 0.7608 | 0.7607 |
Training stopped early at epoch 6 due to validation performance plateau and the best model saved is at the f1 score of 0.7646.
Test Set Performance
{
"eval_loss": 0.5780,
"eval_accuracy": 0.7877,
"eval_f1": 0.7876
}
Confusion Matrix on Test Set
Label order: entailment, neutral, contradiction
[[481 50 18]
[ 78 418 20]
[ 14 20 476]]
Rows represent true labels, columns represent predicted labels.
Classification Report (Test Set)
| Class | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| Entailment | 0.84 | 0.88 | 0.86 | 549 |
| Neutral | 0.86 | 0.81 | 0.83 | 516 |
| Contradiction | 0.93 | 0.93 | 0.93 | 510 |
| Accuracy | 0.87 | 1575 | ||
| Macro Avg | 0.87 | 0.87 | 0.87 | 1575 |
| Weighted Avg | 0.87 | 0.87 | 0.87 | 1575 |
Inference Example
You can use the model as follows:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "emilyyy04/xlm-roberta-base-burmese-nli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
premise = "ဟုတ်တယ်ဗျခင်ဗျားပြောတာမှန်တယ်အခုခေတ်မှာလူငယ်တွေကနည်းပညာကိုအရင်ကထက်ပိုပြီးကျွမ်းကျင်လာကြတော့အလုပ်အကိုင်အခွင့်အလမ်းသစ်တွေအများကြီးပေါ်လာတာပေါ့"
hypothesis = "နည်းပညာတိုးတက်လာတာကလူငယ်တွေအတွက်အခွင့်အလမ်းကောင်းတွေဖြစ်စေတယ်"
inputs = tokenizer(
premise,
hypothesis,
return_tensors="pt",
truncation=True,
padding=True,
max_length=128
)
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()
label_map = {0: "entailment", 1: "neutral", 2: "contradiction"}
print("Predicted label:", label_map[predicted_class])
# conf
probs = torch.softmax(outputs.logits, dim=-1)[0]
print("Confidence:", {k: round(float(probs[i]), 3) for i, k in label_map.items()})
Our Team
This project is part of our Semester-8 CS-502 Natural Language Processing course.
| Name | Role | GitHub |
|---|---|---|
| Yoon Thiri Aung | Leader | https://github.com/yoon-thiri04 |
| Soe Sett Lynn | Member | https://github.com/ssettlynn |
| Thura Aung | Member | - |
Limitations & Future Work
Genre-aware and zero-shot classification is planned but not yet implemented
Performance may vary for:
- Very long inputs
- Out-of-domain or highly informal Burmese
Future improvements:
- Larger native Burmese NLI dataset
- Explicit genre-based evaluation
- Domain adaptation
- Downloads last month
- 46