Yoruba Punctuation Restoration Model

A BERT-based token classification model for punctuation restoration in Yoruba text. Given raw unpunctuated Yoruba text, this model predicts punctuation marks at the token level — a key preprocessing step for downstream NLP tasks on this low-resource West African language.

📄 Dataset: abnuel/yor_punctuation (1M–10M tokens)

Model Description

Yoruba is a tonal language spoken by approximately 40–50 million people, primarily in southwestern Nigeria. Despite its large speaker base, it remains significantly underrepresented in NLP tooling. Unpunctuated Yoruba text is common in real-world sources (social media, transcribed speech, scanned documents), creating a major barrier for parsing, translation, and other NLP tasks.

This model addresses that gap by restoring punctuation as a sequence labeling task, fine-tuned on a large Yoruba text corpus.

Base model: Davlan/bert-base-multilingual-cased-finetuned-yoruba
Task: Token classification (punctuation restoration)
Language: Yoruba (yo)
Parameters: 177.3M
Architecture: BERT

Labels

The model predicts one of the following token-level labels:

Label	Description
`O`	No punctuation
`COMMA`	Comma (`,`)
`PERIOD`	Full stop (`.`)
`QUESTION`	Question mark (`?`)
`EXCLAMATION`	Exclamation mark (`!`)

(Exact label set may vary — check config.json for the full id2label mapping.)

How to Use

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_id = "abnuel/yoruba_task1_punctuation_model"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

nlp = pipeline("token-classification", model=model, tokenizer=tokenizer)

text = "Mo lọ sí ọjà lánàánì mo ra ẹran àti ẹyin"
result = nlp(text)
print(result)

Training Details

Fine-tuning approach: Token classification head on top of Yoruba-adapted multilingual BERT
Dataset: abnuel/yor_punctuation
Dataset size: 1M–10M tokens
Reference: Fine-tuning approach informed by arxiv:1910.09700 (Punctuation Restoration using Transformer Models)

Limitations

Performance may be lower on highly dialectal Yoruba or heavily code-switched text.
Tonal diacritics must be present for best results; the model was not specifically trained for diacritic restoration.
Not evaluated on formal literary Yoruba vs. informal/social media registers.

Related Models & Resources

abnuel/yoruba_sent_boundary — Sentence boundary detection for Yoruba
abnuel/yoruba_sent_boundary_2 — Improved sentence boundary model
abnuel/yoruba_punctuation_2 — Iterative punctuation model update
abnuel/yor_punctuation — Training dataset

Citation

@misc{adegunlehin2025yoruba-punct,
  author = {Abayomi Adegunlehin},
  title  = {Yoruba Punctuation Restoration Model},
  year   = {2025},
  url    = {https://huggingface.co/abnuel/yoruba_task1_punctuation_model}
}

Downloads last month: -

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for abnuel/yoruba_task1_punctuation_model

Base model

Davlan/bert-base-multilingual-cased-finetuned-yoruba

Finetuned

(4)

this model

Dataset used to train abnuel/yoruba_task1_punctuation_model

Paper for abnuel/yoruba_task1_punctuation_model

Quantifying the Carbon Emissions of Machine Learning

Paper • 1910.09700 • Published Oct 21, 2019 • 41