Tori v2 — Name Splitter

ModernBERT-base (149M params) fine-tuned for splitting full name strings into forenames and surnames using BIO token classification.

Evaluation Results

Variant	F1	Precision	Recall	Eval Dataset
bilineal (default)	0.9948	0.9948	0.9949	MX/CO/ES/PE/CL names
unilineal	0.9927	0.9927	0.9927	AR/US/BR/PT names

Metrics are entity-level (seqeval) — a name span is only correct if all its tokens match.

Variants

Variant	Countries	Surname Pattern	Subfolder
bilineal (default)	MX, CO, ES, PE, CL	Double surname (paternal + maternal)	`/` (root)
unilineal	AR, US, BR, PT	Single surname	`unilineal/`

Usage

from tori.inference import load_pipeline, split_name

# Default: bilineal model (double-surname countries)
pipe = load_pipeline("ittailup/tori2")
result = split_name(pipe, "Juan Carlos García López")
print(result.forenames)  # ['Juan', 'Carlos']
print(result.surnames)   # ['García', 'López']

# Unilineal model (single-surname countries)
pipe = load_pipeline("ittailup/tori2", variant="unilineal")
result = split_name(pipe, "John Michael Smith")
print(result.forenames)  # ['John', 'Michael']
print(result.surnames)   # ['Smith']

Labels

O — Outside any name entity
B-forenames — Beginning of forename
I-forenames — Inside forename (continuation)
B-surnames — Beginning of surname
I-surnames — Inside surname (continuation)

Important: Custom Aggregation Required

This model uses ModernBERT's GPT-style BPE tokenizer (Ġ prefix), which is not compatible with HuggingFace's built-in aggregation_strategy="simple". Use the tori.inference module which handles subword aggregation correctly, or use aggregation_strategy="none" and aggregate tokens yourself using character offsets.

Training

Base model: answerdotai/ModernBERT-base (149M params)
Training data: philipperemy/name-dataset, Mexico SEP, RENAPER (AR), datos.gob.ar
Batch size: 256 effective (128 x 2 gradient accumulation)
Learning rate: 5e-5, cosine schedule with 10% warmup
Epochs: 3
Precision: bf16
Hardware: NVIDIA A10G (AWS g5.xlarge)

Downloads last month: 19

Safetensors

Model size

0.1B params

Tensor type

F32

Evaluation results

F1 on Bilineal eval split (MX/CO/ES/PE/CL)
self-reported

0.995
Precision on Bilineal eval split (MX/CO/ES/PE/CL)
self-reported

0.995
Recall on Bilineal eval split (MX/CO/ES/PE/CL)
self-reported

0.995
F1 on Unilineal eval split (AR/US/BR/PT)
self-reported

0.993
Precision on Unilineal eval split (AR/US/BR/PT)
self-reported

0.993
Recall on Unilineal eval split (AR/US/BR/PT)
self-reported

0.993