YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)
ace-1/mgpt2-tokenizer
A pure-Python Byte-Pair Encoding tokenizer trained to better handle:
- English
- Hindi (Devanagari + transliterated Latin)
- Kannada (Kannada script + transliterated Latin)
This repo is meant to be used with trust_remote_code=True.
Quickstart
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained('ace-1/mgpt2-tokenizer', trust_remote_code=True)
text = "Hello! नमस्ते! ನಮಸ್ಕಾರ! namaste! namaskara!"
ids = tok.encode(text)
print(len(ids), ids[:20])
print(tok.decode(ids))
Tokenizer spec
- Vocabulary size: 50,257 (GPT‑2 exact terms)
- 256 byte tokens + 50,000 merges +
<|endoftext|>
- 256 byte tokens + 50,000 merges +
- Special tokens:
<|endoftext|> - Implementation: custom python tokenizer under
tokenizer/(loaded dynamically)
Training corpus (tokenizer)
The tokenizer was trained on a deterministic mixture built from:
- FineWeb‑Edu (English)
- AI4Bharat Sangraha synthetic splits:
hin_Deva,hin_Latn,kan_Knda,kan_Latn
Evaluation
This repo includes evaluation.json with tokenizer-only metrics:
- tokens per 1k bytes (lower is better)
- p95 tokens per line (lower is better)
- bucket breakdown: latin / devanagari / kannada / mixed
Files
- Native trained artifact:
mgpt2.model(minbpe-style.modelfile) tokenizer.vocab/tokenizer.model(HF artifacts generated from the native model)tokenization_mgpt2.py(root module entrypoint fortransformersdynamic loading)
Notes / limitations
- This is a slow tokenizer (pure Python). It is intended for research and reproducibility.
- Downstream LM metrics (perplexity, instruction following, DPO) are reported in the main mgpt2 project repo as controlled experiments vs a baseline GPT‑2 tokenizer/model.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support