mmNLI: Burmese Text Natural Language Inference with XLM-RoBERTa

A Burmese NLI model fine-tuned from xlm-roberta-base, trained on a carefully cleaned and curated dataset, achieving 79% accuracy on the test set.

This model predicts the relationship between a premise and a hypothesis as one of:

Entailment
Neutral
Contradiction

Model & Demo

Model >> https://huggingface.co/emilyyy04/xlm-roberta-base-burmese-nli-v3
Space Demo Link >> https://huggingface.co/spaces/emilyyy04/burmese-text-nli-xlmroberta

Model Details

Base model: xlm-roberta-base
Language: Burmese (Myanmar)
Task: Natural Language Inference (NLI)
Labels: entailment, neutral, contradiction
Framework: Transformers / PyTorch

Dataset and its Structure

The dataset consists of ~10k [10,443] samples across three classes:

Label	Class	Count
0	Entailment	3,608
1	Neutral	3,466
2	Contradiction	3,369

and the dataset is prepared from:

By clearning Burmese NLI data (source: [(https://huggingface.co/datasets/akhtet/myanmar-xnli)]) and additional manually created samples.
Translated English NLI data for diversity.
Most samples follow a 1 premise → 3 hypotheses structure. Each hypothesis has a different NLI label.
An genre field is included intended for future zero-shot / cross-genre experiments later.

Preprocessing

Steps:

Unicode normalization (NFC)
Zawgyi detection
Automatic conversion to Unicode if Zawgyi text is detected
Rely on XLM-R subword tokenizer for tokenization

Data Splitting Strategy

To prevent data leakage caused by shared premises:

Train: 70%
Validation: 15%
Test: 15%

Instead of random shuffling:

GroupShuffleSplit is used
Samples with the same premise always stay in the same split
Prevents:
- Premise overlap across splits
- Hypothesis leakage between train / validation / test sets

Training Setup

The model was trained using Hugging Face Transformers with the following key configurations:

Model: facebook/xlm-roberta-base
Epochs: Up to 10 epochs
Early Stopping: Enabled (patience = 1)
Learning Rate: 1.5e-5
Batch Size: 16 (train & evaluation)
Weight Decay: 0.02
Warmup Ratio: 0.1
FP16 Training: Enabled
Best Model Selection: Based on F1-score
Seed: 42

The best checkpoint was automatically loaded at the end of training using early stopping and F1-based model selection.

Evaluation Metrics

The model performance is evaluated using:

Accuracy
Macro F1-score

Training Results

Epoch	Train Loss	Val Loss	Accuracy	F1
1	0.7579	0.7919	0.6415	0.6134
2	0.6764	0.6207	0.7291	0.7265
3	0.5801	0.6861	0.7443	0.7462
4	0.4554	0.6415	0.7481	0.7488
5	0.3848	0.6434	0.7646	0.7646
6	0.3564	0.7296	0.7608	0.7607

Training stopped early at epoch 6 due to validation performance plateau and the best model saved is at the f1 score of 0.7646.

Test Set Performance

{
  "eval_loss": 0.5780,
  "eval_accuracy": 0.7877,
  "eval_f1": 0.7876
}

Confusion Matrix on Test Set

Label order: entailment, neutral, contradiction

[[481  50  18]
 [ 78 418  20]
 [ 14  20 476]]

Rows represent true labels, columns represent predicted labels.

Classification Report (Test Set)

Class	Precision	Recall	F1-score	Support
Entailment	0.84	0.88	0.86	549
Neutral	0.86	0.81	0.83	516
Contradiction	0.93	0.93	0.93	510
Accuracy			0.87	1575
Macro Avg	0.87	0.87	0.87	1575
Weighted Avg	0.87	0.87	0.87	1575

Inference Example

You can use the model as follows:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "emilyyy04/xlm-roberta-base-burmese-nli"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

premise = "ဟုတ်တယ်ဗျခင်ဗျားပြောတာမှန်တယ်အခုခေတ်မှာလူငယ်တွေကနည်းပညာကိုအရင်ကထက်ပိုပြီးကျွမ်းကျင်လာကြတော့အလုပ်အကိုင်အခွင့်အလမ်းသစ်တွေအများကြီးပေါ်လာတာပေါ့"
hypothesis = "နည်းပညာတိုးတက်လာတာကလူငယ်တွေအတွက်အခွင့်အလမ်းကောင်းတွေဖြစ်စေတယ်"

inputs = tokenizer(
    premise,
    hypothesis,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=128
)

outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()

label_map = {0: "entailment", 1: "neutral", 2: "contradiction"}
print("Predicted label:", label_map[predicted_class])
# conf
probs = torch.softmax(outputs.logits, dim=-1)[0]
print("Confidence:", {k: round(float(probs[i]), 3) for i, k in label_map.items()})

Our Team

This project is part of our Semester-8 CS-502 Natural Language Processing course.

Name	Role	GitHub
Yoon Thiri Aung	Leader	https://github.com/yoon-thiri04
Soe Sett Lynn	Member	https://github.com/ssettlynn
Thura Aung	Member	-

Limitations & Future Work

Genre-aware and zero-shot classification is planned but not yet implemented
Performance may vary for:
- Very long inputs
- Out-of-domain or highly informal Burmese
Future improvements:
- Larger native Burmese NLI dataset
- Explicit genre-based evaluation
- Domain adaptation

Downloads last month: 46

Safetensors

Model size

0.3B params

Tensor type

F32

emilyyy04
/

xlm-roberta-base-burmese-nli-v3