TokenizerBench / README.md
anuj0456's picture
Update README.md
c25478c verified
metadata
title: TokenizerBench
emoji: ๐Ÿค—
colorFrom: yellow
colorTo: gray
sdk: gradio
sdk_version: 6.11.0
app_file: app.py
pinned: false
license: mit
short_description: Evaluate & compare tokenizers

TokenizerBench

Evaluate any Hugging Face or tiktoken tokenizer against the TokenizerBench dataset โ€” covering multilingual text, programming languages, scientific formulas, and edge cases.

Features

  • ๐Ÿงช Playground โ€” type any text and see live tokenization (token IDs, fertility, compression, fidelity check)
  • ๐Ÿ“Š Evaluate โ€” run a full evaluation on a single tokenizer with heatmap, language bar chart, and scatter plot
  • โš–๏ธ Compare โ€” compare two tokenizers side-by-side with grouped bar charts and a leaderboard

Dataset categories

Category Subcategories
๐ŸŒ Human languages English, Hindi, Chinese, Arabic, Japanese, German, Russian, Korean
๐Ÿ’ป Programming languages Python, JavaScript, SQL, Rust
๐Ÿงฎ Scientific formulas Algebra, Calculus, Physics, Statistics
โš ๏ธ Edge cases Whitespace, Long tokens, Mixed scripts

Metrics

Metric Better Notes
avg_fertility Lower Tokens per word. Near 1.0 = ideal. โ‰ฅ4 = poor.
avg_compression_ratio Lower Tokens per character.
avg_byte_compression Lower Tokens per UTF-8 byte. Language-agnostic.
fidelity_pass_rate 1.0 Must be 1.0 โ€” any failure indicates a problem.

Supported tokenizer types

  • HuggingFace AutoTokenizer โ€” any model from the Hub, e.g. bert-base-multilingual-cased, xlm-roberta-base, google/mt5-base
  • tiktoken โ€” OpenAI encodings: cl100k_base, o200k_base, p50k_base