Yoruba Punctuation Restoration Model

A BERT-based token classification model for punctuation restoration in Yoruba text. Given raw unpunctuated Yoruba text, this model predicts punctuation marks at the token level — a key preprocessing step for downstream NLP tasks on this low-resource West African language.

📄 Dataset: abnuel/yor_punctuation (1M–10M tokens)

Model Description

Yoruba is a tonal language spoken by approximately 40–50 million people, primarily in southwestern Nigeria. Despite its large speaker base, it remains significantly underrepresented in NLP tooling. Unpunctuated Yoruba text is common in real-world sources (social media, transcribed speech, scanned documents), creating a major barrier for parsing, translation, and other NLP tasks.

This model addresses that gap by restoring punctuation as a sequence labeling task, fine-tuned on a large Yoruba text corpus.

  • Base model: Davlan/bert-base-multilingual-cased-finetuned-yoruba
  • Task: Token classification (punctuation restoration)
  • Language: Yoruba (yo)
  • Parameters: 177.3M
  • Architecture: BERT

Labels

The model predicts one of the following token-level labels:

Label Description
O No punctuation
COMMA Comma (,)
PERIOD Full stop (.)
QUESTION Question mark (?)
EXCLAMATION Exclamation mark (!)

(Exact label set may vary — check config.json for the full id2label mapping.)

How to Use

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_id = "abnuel/yoruba_task1_punctuation_model"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

nlp = pipeline("token-classification", model=model, tokenizer=tokenizer)

text = "Mo lọ sí ọjà lánàánì mo ra ẹran àti ẹyin"
result = nlp(text)
print(result)

Training Details

  • Fine-tuning approach: Token classification head on top of Yoruba-adapted multilingual BERT
  • Dataset: abnuel/yor_punctuation
  • Dataset size: 1M–10M tokens
  • Reference: Fine-tuning approach informed by arxiv:1910.09700 (Punctuation Restoration using Transformer Models)

Limitations

  • Performance may be lower on highly dialectal Yoruba or heavily code-switched text.
  • Tonal diacritics must be present for best results; the model was not specifically trained for diacritic restoration.
  • Not evaluated on formal literary Yoruba vs. informal/social media registers.

Related Models & Resources

Citation

@misc{adegunlehin2025yoruba-punct,
  author = {Abayomi Adegunlehin},
  title  = {Yoruba Punctuation Restoration Model},
  year   = {2025},
  url    = {https://huggingface.co/abnuel/yoruba_task1_punctuation_model}
}
Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for abnuel/yoruba_task1_punctuation_model

Finetuned
(4)
this model

Dataset used to train abnuel/yoruba_task1_punctuation_model

Paper for abnuel/yoruba_task1_punctuation_model