Yoruba Punctuation Restoration Model
A BERT-based token classification model for punctuation restoration in Yoruba text. Given raw unpunctuated Yoruba text, this model predicts punctuation marks at the token level — a key preprocessing step for downstream NLP tasks on this low-resource West African language.
📄 Dataset: abnuel/yor_punctuation (1M–10M tokens)
Model Description
Yoruba is a tonal language spoken by approximately 40–50 million people, primarily in southwestern Nigeria. Despite its large speaker base, it remains significantly underrepresented in NLP tooling. Unpunctuated Yoruba text is common in real-world sources (social media, transcribed speech, scanned documents), creating a major barrier for parsing, translation, and other NLP tasks.
This model addresses that gap by restoring punctuation as a sequence labeling task, fine-tuned on a large Yoruba text corpus.
- Base model: Davlan/bert-base-multilingual-cased-finetuned-yoruba
- Task: Token classification (punctuation restoration)
- Language: Yoruba (
yo) - Parameters: 177.3M
- Architecture: BERT
Labels
The model predicts one of the following token-level labels:
| Label | Description |
|---|---|
O |
No punctuation |
COMMA |
Comma (,) |
PERIOD |
Full stop (.) |
QUESTION |
Question mark (?) |
EXCLAMATION |
Exclamation mark (!) |
(Exact label set may vary — check config.json for the full id2label mapping.)
How to Use
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
model_id = "abnuel/yoruba_task1_punctuation_model"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
nlp = pipeline("token-classification", model=model, tokenizer=tokenizer)
text = "Mo lọ sí ọjà lánàánì mo ra ẹran àti ẹyin"
result = nlp(text)
print(result)
Training Details
- Fine-tuning approach: Token classification head on top of Yoruba-adapted multilingual BERT
- Dataset: abnuel/yor_punctuation
- Dataset size: 1M–10M tokens
- Reference: Fine-tuning approach informed by arxiv:1910.09700 (Punctuation Restoration using Transformer Models)
Limitations
- Performance may be lower on highly dialectal Yoruba or heavily code-switched text.
- Tonal diacritics must be present for best results; the model was not specifically trained for diacritic restoration.
- Not evaluated on formal literary Yoruba vs. informal/social media registers.
Related Models & Resources
- abnuel/yoruba_sent_boundary — Sentence boundary detection for Yoruba
- abnuel/yoruba_sent_boundary_2 — Improved sentence boundary model
- abnuel/yoruba_punctuation_2 — Iterative punctuation model update
- abnuel/yor_punctuation — Training dataset
Citation
@misc{adegunlehin2025yoruba-punct,
author = {Abayomi Adegunlehin},
title = {Yoruba Punctuation Restoration Model},
year = {2025},
url = {https://huggingface.co/abnuel/yoruba_task1_punctuation_model}
}
- Downloads last month
- -