Text Register FastText Classifier
A FastText classifier that detects the communicative register (text type) of any English text at ~500k predictions/sec on CPU.
Labels
| Code | Register | Description | Example |
|---|---|---|---|
IN |
Informational | Factual, encyclopedic, descriptive | Wikipedia articles, reports |
NA |
Narrative | Story-like, temporal sequence of events | News stories, fiction, blog posts |
OP |
Opinion | Subjective evaluation, personal views | Reviews, editorials, comments |
IP |
Persuasion | Attempts to convince or sell | Marketing copy, ads, fundraising |
HI |
HowTo | Instructions, procedures, recipes | Tutorials, manuals, FAQs |
ID |
Discussion | Interactive, forum-style dialogue | Forum threads, Q&A, comments |
SP |
Spoken | Transcribed or spoken-style text | Interviews, podcasts, speeches |
LY |
Lyrical | Poetic, artistic, song-like | Poetry, song lyrics, creative prose |
Based on the Biber & Egbert (2018) register taxonomy. Multi-label supported (a text can be both Informational and Narrative).
Quick Start
import fasttext
from huggingface_hub import hf_hub_download
# Download model (quantized, 151 MB)
model_path = hf_hub_download(
"oneryalcin/text-register-fasttext-classifier",
"register_fasttext_q.bin"
)
model = fasttext.load_model(model_path)
# Predict
labels, probs = model.predict("Buy now and save 50%! Limited time offer!", k=3)
# -> [('__label__IP', 1.0), ...] # IP = Persuasion
Note: If you get a numpy error, pin
numpy<2:pip install "numpy<2"
Performance
Trained on 10 English shards from TurkuNLP/register_oscar (~1.9M documents), balanced via oversampling/undersampling to median class size.
Overall Metrics
| Metric | Full Model | Quantized |
|---|---|---|
| Precision@1 | 0.831 | 0.796 |
| Recall@1 | 0.759 | 0.727 |
| Precision@2 | 0.491 | — |
| Recall@2 | 0.898 | — |
| Speed | ~500k pred/s | ~500k pred/s |
| Size | 1.1 GB | 151 MB |
Per-Class F1 (threshold=0.3, k=2)
| Register | Precision | Recall | F1 | Test Support |
|---|---|---|---|---|
| Informational | 0.910 | 0.666 | 0.769 | 108,672 |
| Narrative | 0.764 | 0.766 | 0.765 | 44,238 |
| Discussion | 0.640 | 0.774 | 0.701 | 7,420 |
| Persuasion | 0.553 | 0.794 | 0.652 | 19,193 |
| Opinion | 0.567 | 0.736 | 0.640 | 20,014 |
| HowTo | 0.515 | 0.766 | 0.616 | 7,281 |
| Spoken | 0.551 | 0.513 | 0.531 | 831 |
| Lyrical | 0.657 | 0.442 | 0.529 | 251 |
Example Predictions
"The company reported revenue of $4.2 billion..." -> Informational (1.00), Narrative (0.99)
"Once upon a time in a small village..." -> Narrative
"I honestly think this movie is terrible..." -> Opinion (1.00)
"To install the package, first run pip install..." -> HowTo (1.00)
"Buy now and save 50%! Limited time offer..." -> Persuasion (1.00)
"So like, I was telling her yesterday..." -> Spoken (1.00)
"I've been walking these streets alone..." -> Lyrical (1.00)
"Hey everyone! What do you think about..." -> Discussion (1.00)
"Introducing the revolutionary SkinGlow Pro..." -> Persuasion (1.00)
Use Cases
- Data curation: Filter pretraining corpora by register (e.g., keep only Informational + HowTo)
- Content routing: Route incoming text to different processing pipelines
- Boilerplate removal: Flag Persuasion/Marketing text in document corpora
- Signal extraction: Identify which paragraphs in a document carry factual vs opinion content
- RAG preprocessing: Score chunks by register before feeding to LLMs
Reproduce from Scratch
1. Download data
pip install huggingface_hub
# Download 10 English shards (~4 GB)
for i in $(seq 0 9); do
hf download TurkuNLP/register_oscar \
$(printf "en/en_%05d.jsonl.gz" $i) \
--repo-type dataset --local-dir ./data
done
2. Prepare balanced training data
python scripts/prepare_data.py --data-dir ./data/en --output-dir ./prepared
3. Train
pip install fasttext-wheel "numpy<2"
python scripts/train.py --train ./prepared/train.txt --test ./prepared/test.txt --output ./model
4. Predict
# Interactive
python scripts/predict.py --model ./model/register_fasttext_q.bin
# Single text
python scripts/predict.py --model ./model/register_fasttext_q.bin --text "Buy now! 50% off!"
# Batch
python scripts/predict.py --model ./model/register_fasttext_q.bin --input texts.txt --output out.jsonl
Training Details
- Source data: TurkuNLP/register_oscar (English, 10 shards, ~1.9M labeled documents)
- Balancing: Minority classes oversampled, majority classes undersampled to median class size (~129k per class)
- Architecture: FastText supervised with bigrams, 100-dim embeddings, one-vs-all loss
- Hyperparameters: lr=0.5, epoch=25, wordNgrams=2, dim=100, loss=ova, bucket=2M
- Text preprocessing: Whitespace collapsed, truncated to 500 words
Limitations
- Spoken & Lyrical classes have lower F1 (~0.53) due to limited unique training data even after oversampling
- Trained on web text only — may not generalize well to domain-specific text (legal, medical)
- Bag-of-words model — does not understand word order or deep semantics
- English only (the source dataset has other languages that could be used for multilingual training)
Citation
If you use this model, please cite the source dataset:
@inproceedings{register_oscar,
title={Multilingual register classification on the full OSCAR data},
author={R{\"o}nnqvist, Samuel and others},
year={2023},
note={TurkuNLP, University of Turku}
}
@article{biber2018register,
title={Register as a predictor of linguistic variation},
author={Biber, Douglas and Egbert, Jesse},
journal={Corpus Linguistics and Linguistic Theory},
year={2018}
}
License
The model weights inherit the license of the source dataset (TurkuNLP/register_oscar). Scripts are released under MIT.
- Downloads last month
- 56