ace-1
/

mgpt2-tokenizer

Model card Files Files and versions

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

ace-1/mgpt2-tokenizer

A pure-Python Byte-Pair Encoding tokenizer trained to better handle:

English
Hindi (Devanagari + transliterated Latin)
Kannada (Kannada script + transliterated Latin)

This repo is meant to be used with trust_remote_code=True.

Quickstart

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained('ace-1/mgpt2-tokenizer', trust_remote_code=True)
text = "Hello! नमस्ते! ನಮಸ್ಕಾರ! namaste! namaskara!"
ids = tok.encode(text)
print(len(ids), ids[:20])
print(tok.decode(ids))

Tokenizer spec

Vocabulary size: 50,257 (GPT‑2 exact terms)
- 256 byte tokens + 50,000 merges + <|endoftext|>
Special tokens: <|endoftext|>
Implementation: custom python tokenizer under tokenizer/ (loaded dynamically)

Training corpus (tokenizer)

The tokenizer was trained on a deterministic mixture built from:

FineWeb‑Edu (English)
AI4Bharat Sangraha synthetic splits: hin_Deva, hin_Latn, kan_Knda, kan_Latn

Evaluation

This repo includes evaluation.json with tokenizer-only metrics:

tokens per 1k bytes (lower is better)
p95 tokens per line (lower is better)
bucket breakdown: latin / devanagari / kannada / mixed

Files

Native trained artifact: mgpt2.model (minbpe-style .model file)
tokenizer.vocab / tokenizer.model (HF artifacts generated from the native model)
tokenization_mgpt2.py (root module entrypoint for transformers dynamic loading)

Notes / limitations

This is a slow tokenizer (pure Python). It is intended for research and reproducibility.
Downstream LM metrics (perplexity, instruction following, DPO) are reported in the main mgpt2 project repo as controlled experiments vs a baseline GPT‑2 tokenizer/model.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support