Spaces:
Running
Running
File size: 1,733 Bytes
43b7508 feb05fd 694314c 43b7508 c25478c 43b7508 694314c 43b7508 feb05fd 94c5b63 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | ---
title: TokenizerBench
emoji: ๐ค
colorFrom: yellow
colorTo: gray
sdk: gradio
sdk_version: 6.11.0
app_file: app.py
pinned: false
license: mit
short_description: Evaluate & compare tokenizers
---
# TokenizerBench
Evaluate any Hugging Face or tiktoken tokenizer against the **TokenizerBench** dataset โ covering multilingual text, programming languages, scientific formulas, and edge cases.
## Features
- **๐งช Playground** โ type any text and see live tokenization (token IDs, fertility, compression, fidelity check)
- **๐ Evaluate** โ run a full evaluation on a single tokenizer with heatmap, language bar chart, and scatter plot
- **โ๏ธ Compare** โ compare two tokenizers side-by-side with grouped bar charts and a leaderboard
## Dataset categories
| Category | Subcategories |
|----------|--------------|
| ๐ Human languages | English, Hindi, Chinese, Arabic, Japanese, German, Russian, Korean |
| ๐ป Programming languages | Python, JavaScript, SQL, Rust |
| ๐งฎ Scientific formulas | Algebra, Calculus, Physics, Statistics |
| โ ๏ธ Edge cases | Whitespace, Long tokens, Mixed scripts |
## Metrics
| Metric | Better | Notes |
|--------|--------|-------|
| `avg_fertility` | Lower | Tokens per word. Near 1.0 = ideal. โฅ4 = poor. |
| `avg_compression_ratio` | Lower | Tokens per character. |
| `avg_byte_compression` | Lower | Tokens per UTF-8 byte. Language-agnostic. |
| `fidelity_pass_rate` | 1.0 | Must be 1.0 โ any failure indicates a problem. |
## Supported tokenizer types
- **HuggingFace AutoTokenizer** โ any model from the Hub, e.g. `bert-base-multilingual-cased`, `xlm-roberta-base`, `google/mt5-base`
- **tiktoken** โ OpenAI encodings: `cl100k_base`, `o200k_base`, `p50k_base` |