Corpus of Multilingual Medieval Archives

community

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

ponteineptique updated a Space about 2 months ago

comma-project/README

ponteineptique updated a Space about 2 months ago

comma-project/pre-editorial-normalization

ponteineptique published a Space about 2 months ago

comma-project/pre-editorial-normalization

View all activity

Organization Card

Community About org cards

CoMMA: Corpus of Multilingual Medieval Archives

CoMMA is an initiative dedicated to the computational study of medieval primary sources. We bridge the gap between traditional philology and NLP by providing standardized, machine-readable datasets of archival texts.

Original paper:

@unpublished{clerice:hal-05299220,
  TITLE = {{CoMMA, a Large-scale Corpus of Multilingual Medieval Archives}},
  AUTHOR = {Cl{'e}rice, Thibault and Gabay, Simon and Vlachou-Efstathiou, Malamatenia and Pinche, Ariane and Sagot, Beno{^i}t},
  URL = {https://inria.hal.science/hal-05299220},
  NOTE = {working paper or preprint},
  YEAR = {2025},
  MONTH = Oct,
  KEYWORDS = {Automatic Text Recognition Medieval manuscripts Latin French Digital humanities Corpus ; Automatic Text Recognition ; Medieval manuscripts ; Latin ; French ; Digital humanities ; Corpus},
  PDF = {https://inria.hal.science/hal-05299220v1/file/Latin_and_Old_French_Manuscripts-8.pdf},
  HAL_ID = {hal-05299220},
  HAL_VERSION = {v1},
}

🏛️ What’s Inside

Multilingual Corpora: Annotated texts in Old French and Medieval Latin.
Specialized Models: Tokenizers and fine-tuned embedding models optimized for non-standardized orthography and medieval syntax.

See some demo:

https://comma.inria.fr to browse the corpus
Demo for our normalization model https://huggingface.co/spaces/comma-project/pre-editorial-normalization

License: CC-BY 4.0

spaces 2

Pre Editorial Normalization

📚

Latin and Old French normalization of CATMuS output

models 4

Corpus of Multilingual Medieval Archives

AI & ML interests

Recent Activity

CoMMA: Corpus of Multilingual Medieval Archives

🏛️ What’s Inside

spaces 2

Pre Editorial Normalization

models 4

comma-project/normalization-byt5-small

comma-project/fasttext

comma-project/modernbert

comma-project/modernbert-sentembeddings

datasets 4

comma-project/pen-alignment-pairs

comma-project/deep-jsonl

comma-project/comma-jsonl

comma-project/comma-other-formats

AI & ML interests

Recent Activity

Team members 5

CoMMA: Corpus of Multilingual Medieval Archives

🏛️ What’s Inside

spaces 2 Sort: Recently updated

Pre Editorial Normalization

models 4 Sort: Recently updated

datasets 4 Sort: Recently updated

spaces 2

models 4

datasets 4