Spaces:
Running
Running
metadata
title: TokenizerBench
emoji: ๐ค
colorFrom: yellow
colorTo: gray
sdk: gradio
sdk_version: 6.11.0
app_file: app.py
pinned: false
license: mit
short_description: Evaluate & compare tokenizers
TokenizerBench
Evaluate any Hugging Face or tiktoken tokenizer against the TokenizerBench dataset โ covering multilingual text, programming languages, scientific formulas, and edge cases.
Features
- ๐งช Playground โ type any text and see live tokenization (token IDs, fertility, compression, fidelity check)
- ๐ Evaluate โ run a full evaluation on a single tokenizer with heatmap, language bar chart, and scatter plot
- โ๏ธ Compare โ compare two tokenizers side-by-side with grouped bar charts and a leaderboard
Dataset categories
| Category | Subcategories |
|---|---|
| ๐ Human languages | English, Hindi, Chinese, Arabic, Japanese, German, Russian, Korean |
| ๐ป Programming languages | Python, JavaScript, SQL, Rust |
| ๐งฎ Scientific formulas | Algebra, Calculus, Physics, Statistics |
| โ ๏ธ Edge cases | Whitespace, Long tokens, Mixed scripts |
Metrics
| Metric | Better | Notes |
|---|---|---|
avg_fertility |
Lower | Tokens per word. Near 1.0 = ideal. โฅ4 = poor. |
avg_compression_ratio |
Lower | Tokens per character. |
avg_byte_compression |
Lower | Tokens per UTF-8 byte. Language-agnostic. |
fidelity_pass_rate |
1.0 | Must be 1.0 โ any failure indicates a problem. |
Supported tokenizer types
- HuggingFace AutoTokenizer โ any model from the Hub, e.g.
bert-base-multilingual-cased,xlm-roberta-base,google/mt5-base - tiktoken โ OpenAI encodings:
cl100k_base,o200k_base,p50k_base