Text Register FastText Classifier

A FastText classifier that detects the communicative register (text type) of any English text at ~500k predictions/sec on CPU.

Labels

Code Register Description Example
IN Informational Factual, encyclopedic, descriptive Wikipedia articles, reports
NA Narrative Story-like, temporal sequence of events News stories, fiction, blog posts
OP Opinion Subjective evaluation, personal views Reviews, editorials, comments
IP Persuasion Attempts to convince or sell Marketing copy, ads, fundraising
HI HowTo Instructions, procedures, recipes Tutorials, manuals, FAQs
ID Discussion Interactive, forum-style dialogue Forum threads, Q&A, comments
SP Spoken Transcribed or spoken-style text Interviews, podcasts, speeches
LY Lyrical Poetic, artistic, song-like Poetry, song lyrics, creative prose

Based on the Biber & Egbert (2018) register taxonomy. Multi-label supported (a text can be both Informational and Narrative).

Quick Start

import fasttext
from huggingface_hub import hf_hub_download

# Download model (quantized, 151 MB)
model_path = hf_hub_download(
    "oneryalcin/text-register-fasttext-classifier",
    "register_fasttext_q.bin"
)
model = fasttext.load_model(model_path)

# Predict
labels, probs = model.predict("Buy now and save 50%! Limited time offer!", k=3)
# -> [('__label__IP', 1.0), ...]  # IP = Persuasion

Note: If you get a numpy error, pin numpy<2: pip install "numpy<2"

Performance

Trained on 10 English shards from TurkuNLP/register_oscar (~1.9M documents), balanced via oversampling/undersampling to median class size.

Overall Metrics

Metric Full Model Quantized
Precision@1 0.831 0.796
Recall@1 0.759 0.727
Precision@2 0.491 —
Recall@2 0.898 —
Speed ~500k pred/s ~500k pred/s
Size 1.1 GB 151 MB

Per-Class F1 (threshold=0.3, k=2)

Register Precision Recall F1 Test Support
Informational 0.910 0.666 0.769 108,672
Narrative 0.764 0.766 0.765 44,238
Discussion 0.640 0.774 0.701 7,420
Persuasion 0.553 0.794 0.652 19,193
Opinion 0.567 0.736 0.640 20,014
HowTo 0.515 0.766 0.616 7,281
Spoken 0.551 0.513 0.531 831
Lyrical 0.657 0.442 0.529 251

Example Predictions

"The company reported revenue of $4.2 billion..."       -> Informational (1.00), Narrative (0.99)
"Once upon a time in a small village..."                -> Narrative
"I honestly think this movie is terrible..."            -> Opinion (1.00)
"To install the package, first run pip install..."      -> HowTo (1.00)
"Buy now and save 50%! Limited time offer..."           -> Persuasion (1.00)
"So like, I was telling her yesterday..."               -> Spoken (1.00)
"I've been walking these streets alone..."              -> Lyrical (1.00)
"Hey everyone! What do you think about..."              -> Discussion (1.00)
"Introducing the revolutionary SkinGlow Pro..."         -> Persuasion (1.00)

Use Cases

  • Data curation: Filter pretraining corpora by register (e.g., keep only Informational + HowTo)
  • Content routing: Route incoming text to different processing pipelines
  • Boilerplate removal: Flag Persuasion/Marketing text in document corpora
  • Signal extraction: Identify which paragraphs in a document carry factual vs opinion content
  • RAG preprocessing: Score chunks by register before feeding to LLMs

Reproduce from Scratch

1. Download data

pip install huggingface_hub

# Download 10 English shards (~4 GB)
for i in $(seq 0 9); do
    hf download TurkuNLP/register_oscar \
        $(printf "en/en_%05d.jsonl.gz" $i) \
        --repo-type dataset --local-dir ./data
done

2. Prepare balanced training data

python scripts/prepare_data.py --data-dir ./data/en --output-dir ./prepared

3. Train

pip install fasttext-wheel "numpy<2"
python scripts/train.py --train ./prepared/train.txt --test ./prepared/test.txt --output ./model

4. Predict

# Interactive
python scripts/predict.py --model ./model/register_fasttext_q.bin

# Single text
python scripts/predict.py --model ./model/register_fasttext_q.bin --text "Buy now! 50% off!"

# Batch
python scripts/predict.py --model ./model/register_fasttext_q.bin --input texts.txt --output out.jsonl

Training Details

  • Source data: TurkuNLP/register_oscar (English, 10 shards, ~1.9M labeled documents)
  • Balancing: Minority classes oversampled, majority classes undersampled to median class size (~129k per class)
  • Architecture: FastText supervised with bigrams, 100-dim embeddings, one-vs-all loss
  • Hyperparameters: lr=0.5, epoch=25, wordNgrams=2, dim=100, loss=ova, bucket=2M
  • Text preprocessing: Whitespace collapsed, truncated to 500 words

Limitations

  • Spoken & Lyrical classes have lower F1 (~0.53) due to limited unique training data even after oversampling
  • Trained on web text only — may not generalize well to domain-specific text (legal, medical)
  • Bag-of-words model — does not understand word order or deep semantics
  • English only (the source dataset has other languages that could be used for multilingual training)

Citation

If you use this model, please cite the source dataset:

@inproceedings{register_oscar,
  title={Multilingual register classification on the full OSCAR data},
  author={R{\"o}nnqvist, Samuel and others},
  year={2023},
  note={TurkuNLP, University of Turku}
}

@article{biber2018register,
  title={Register as a predictor of linguistic variation},
  author={Biber, Douglas and Egbert, Jesse},
  journal={Corpus Linguistics and Linguistic Theory},
  year={2018}
}

License

The model weights inherit the license of the source dataset (TurkuNLP/register_oscar). Scripts are released under MIT.

Downloads last month
56
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train oneryalcin/text-register-fasttext-classifier