YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

ace-1/mgpt2-tokenizer

A pure-Python Byte-Pair Encoding tokenizer trained to better handle:

  • English
  • Hindi (Devanagari + transliterated Latin)
  • Kannada (Kannada script + transliterated Latin)

This repo is meant to be used with trust_remote_code=True.

Quickstart

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained('ace-1/mgpt2-tokenizer', trust_remote_code=True)
text = "Hello! नमस्ते! ನಮಸ್ಕಾರ! namaste! namaskara!"
ids = tok.encode(text)
print(len(ids), ids[:20])
print(tok.decode(ids))

Tokenizer spec

  • Vocabulary size: 50,257 (GPT‑2 exact terms)
    • 256 byte tokens + 50,000 merges + <|endoftext|>
  • Special tokens: <|endoftext|>
  • Implementation: custom python tokenizer under tokenizer/ (loaded dynamically)

Training corpus (tokenizer)

The tokenizer was trained on a deterministic mixture built from:

  • FineWeb‑Edu (English)
  • AI4Bharat Sangraha synthetic splits: hin_Deva, hin_Latn, kan_Knda, kan_Latn

Evaluation

This repo includes evaluation.json with tokenizer-only metrics:

  • tokens per 1k bytes (lower is better)
  • p95 tokens per line (lower is better)
  • bucket breakdown: latin / devanagari / kannada / mixed

Files

  • Native trained artifact: mgpt2.model (minbpe-style .model file)
  • tokenizer.vocab / tokenizer.model (HF artifacts generated from the native model)
  • tokenization_mgpt2.py (root module entrypoint for transformers dynamic loading)

Notes / limitations

  • This is a slow tokenizer (pure Python). It is intended for research and reproducibility.
  • Downstream LM metrics (perplexity, instruction following, DPO) are reported in the main mgpt2 project repo as controlled experiments vs a baseline GPT‑2 tokenizer/model.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support