File size: 1,733 Bytes
43b7508
 
feb05fd
 
694314c
43b7508
c25478c
43b7508
 
 
694314c
43b7508
 
feb05fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94c5b63
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
---
title: TokenizerBench
emoji: ๐Ÿค—
colorFrom: yellow
colorTo: gray
sdk: gradio
sdk_version: 6.11.0
app_file: app.py
pinned: false
license: mit
short_description: Evaluate & compare tokenizers
---

# TokenizerBench

Evaluate any Hugging Face or tiktoken tokenizer against the **TokenizerBench** dataset โ€” covering multilingual text, programming languages, scientific formulas, and edge cases.

## Features

- **๐Ÿงช Playground** โ€” type any text and see live tokenization (token IDs, fertility, compression, fidelity check)
- **๐Ÿ“Š Evaluate** โ€” run a full evaluation on a single tokenizer with heatmap, language bar chart, and scatter plot
- **โš–๏ธ Compare** โ€” compare two tokenizers side-by-side with grouped bar charts and a leaderboard

## Dataset categories

| Category | Subcategories |
|----------|--------------|
| ๐ŸŒ Human languages | English, Hindi, Chinese, Arabic, Japanese, German, Russian, Korean |
| ๐Ÿ’ป Programming languages | Python, JavaScript, SQL, Rust |
| ๐Ÿงฎ Scientific formulas | Algebra, Calculus, Physics, Statistics |
| โš ๏ธ Edge cases | Whitespace, Long tokens, Mixed scripts |

## Metrics

| Metric | Better | Notes |
|--------|--------|-------|
| `avg_fertility` | Lower | Tokens per word. Near 1.0 = ideal. โ‰ฅ4 = poor. |
| `avg_compression_ratio` | Lower | Tokens per character. |
| `avg_byte_compression` | Lower | Tokens per UTF-8 byte. Language-agnostic. |
| `fidelity_pass_rate` | 1.0 | Must be 1.0 โ€” any failure indicates a problem. |

## Supported tokenizer types

- **HuggingFace AutoTokenizer** โ€” any model from the Hub, e.g. `bert-base-multilingual-cased`, `xlm-roberta-base`, `google/mt5-base`
- **tiktoken** โ€” OpenAI encodings: `cl100k_base`, `o200k_base`, `p50k_base`