--- title: TokenizerBench emoji: ๐Ÿค— colorFrom: yellow colorTo: gray sdk: gradio sdk_version: 6.11.0 app_file: app.py pinned: false license: mit short_description: Evaluate & compare tokenizers --- # TokenizerBench Evaluate any Hugging Face or tiktoken tokenizer against the **TokenizerBench** dataset โ€” covering multilingual text, programming languages, scientific formulas, and edge cases. ## Features - **๐Ÿงช Playground** โ€” type any text and see live tokenization (token IDs, fertility, compression, fidelity check) - **๐Ÿ“Š Evaluate** โ€” run a full evaluation on a single tokenizer with heatmap, language bar chart, and scatter plot - **โš–๏ธ Compare** โ€” compare two tokenizers side-by-side with grouped bar charts and a leaderboard ## Dataset categories | Category | Subcategories | |----------|--------------| | ๐ŸŒ Human languages | English, Hindi, Chinese, Arabic, Japanese, German, Russian, Korean | | ๐Ÿ’ป Programming languages | Python, JavaScript, SQL, Rust | | ๐Ÿงฎ Scientific formulas | Algebra, Calculus, Physics, Statistics | | โš ๏ธ Edge cases | Whitespace, Long tokens, Mixed scripts | ## Metrics | Metric | Better | Notes | |--------|--------|-------| | `avg_fertility` | Lower | Tokens per word. Near 1.0 = ideal. โ‰ฅ4 = poor. | | `avg_compression_ratio` | Lower | Tokens per character. | | `avg_byte_compression` | Lower | Tokens per UTF-8 byte. Language-agnostic. | | `fidelity_pass_rate` | 1.0 | Must be 1.0 โ€” any failure indicates a problem. | ## Supported tokenizer types - **HuggingFace AutoTokenizer** โ€” any model from the Hub, e.g. `bert-base-multilingual-cased`, `xlm-roberta-base`, `google/mt5-base` - **tiktoken** โ€” OpenAI encodings: `cl100k_base`, `o200k_base`, `p50k_base`