thekusaldarshana commited on
Commit
b3a398c
·
verified ·
1 Parent(s): e51bea7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -24
README.md CHANGED
@@ -78,38 +78,38 @@ Sinhala and Devanagari serve as the high-complexity proofs-of-concept. The same
78
  We evaluated WWHO against frontier models across a 1.5 million sentence code-switched corpus containing Sinhala, Hindi (Devanagari), and English.
79
 
80
  ### 1. Sinhala Efficiency
81
- |Tokenizer | Tokens | TWR | Chr/Tok | % Reduction
82
- |----------------------------------------------------------------------
83
- |**SGPE(WWHO) | 6,654,288 | 1.274 | 4.83 | -**
84
- |OpenAI (o200k_base) | 17,360,196 | 3.324 | 1.85 | 61.7%
85
- |Llama 4 Scout | 18,157,707 | 3.476 | 1.77 | 63.4%
86
- |DeepSeek V3 | 29,152,698 | 5.581 | 1.10 | 77.2%
87
 
88
  ### 2. Hindi (Devanagari) Efficiency
89
- |Tokenizer | Tokens | TWR | Chr/Tok | % Reduction
90
- |----------------------------------------------------------------------
91
- |**SGPE(WWHO) | 13,433,554 | 1.181 | 4.29 | -**
92
- |OpenAI (o200k_base) | 18,394,075 | 1.617 | 3.13 | 27.0%
93
- |Llama 4 Scout | 19,566,121 | 1.720 | 2.94 | 31.3%
94
- |DeepSeek V3 | 31,682,218 | 2.786 | 1.82 | 57.6%
95
 
96
  ### 3. English
97
- |Tokenizer | Tokens | TWR | Chr/Tok | % Reduction
98
- |----------------------------------------------------------------------
99
- |**SGPE(WWHO) | 7,240,147 | 1.330 | 4.46 | -**
100
- |OpenAI (o200k_base) | 7,420,527 | 1.364 | 4.35 | 2.4%
101
- |Llama 4 Scout | 7,512,843 | 1.381 | 4.30 | 3.6%
102
- |DeepSeek V3 | 7,904,670 | 1.453 | 4.09 | 8.4%
103
 
104
  *(Note: Because WWHO routes Latin text directly to the native Tiktoken sequence, English performance is mathematically identical. The minor delta in total tokens emerges solely from boundary crossing mechanics.)*
105
 
106
  ### 4. Overall (Mixed-Script)
107
- |Tokenizer | Tokens | TWR | Chr/Tok | % Reduction
108
- |----------------------------------------------------------------------
109
- |**SGPE(WWHO) | 27,327,989 | 1.240 | 4.47 | -**
110
- |OpenAI (o200k_base) | 43,174,798 | 1.959 | 2.83 | 36.7%
111
- |Llama 4 Scout | 45,236,671 | 2.053 | 2.70 | 39.6%
112
- |DeepSeek V3 | 68,739,586 | 3.119 | 1.78 | 60.2%
113
 
114
  - **Zero-Breakage Guarantee**: Validated through exhaustive testing permutations across all supported Abugida scripts (0 violations).
115
  - **Full-corpus reconstruction**: 1.5M code-switched sentences encoded and decoded with 0 non-UNK mismatches.
 
78
  We evaluated WWHO against frontier models across a 1.5 million sentence code-switched corpus containing Sinhala, Hindi (Devanagari), and English.
79
 
80
  ### 1. Sinhala Efficiency
81
+ | Tokenizer | Tokens | TWR | Chr/Tok | % Reduction |
82
+ |---|---|---|---|---|
83
+ | **SGPE(WWHO)** | **6,654,288** | **1.274** | **4.83** | **-** |
84
+ | OpenAI (o200k_base) | 17,360,196 | 3.324 | 1.85 | 61.7% |
85
+ | Llama 4 Scout | 18,157,707 | 3.476 | 1.77 | 63.4% |
86
+ | DeepSeek V3 | 29,152,698 | 5.581 | 1.10 | 77.2% |
87
 
88
  ### 2. Hindi (Devanagari) Efficiency
89
+ | Tokenizer | Tokens | TWR | Chr/Tok | % Reduction |
90
+ |---|---|---|---|---|
91
+ | **SGPE(WWHO)** | **13,433,554** | **1.181** | **4.29** | **-** |
92
+ | OpenAI (o200k_base) | 18,394,075 | 1.617 | 3.13 | 27.0% |
93
+ | Llama 4 Scout | 19,566,121 | 1.720 | 2.94 | 31.3% |
94
+ | DeepSeek V3 | 31,682,218 | 2.786 | 1.82 | 57.6% |
95
 
96
  ### 3. English
97
+ | Tokenizer | Tokens | TWR | Chr/Tok | % Reduction |
98
+ |---|---|---|---|---|
99
+ | **SGPE(WWHO)** | **7,240,147** | **1.330** | **4.46** | **-** |
100
+ | OpenAI (o200k_base) | 7,420,527 | 1.364 | 4.35 | 2.4% |
101
+ | Llama 4 Scout | 7,512,843 | 1.381 | 4.30 | 3.6% |
102
+ | DeepSeek V3 | 7,904,670 | 1.453 | 4.09 | 8.4% |
103
 
104
  *(Note: Because WWHO routes Latin text directly to the native Tiktoken sequence, English performance is mathematically identical. The minor delta in total tokens emerges solely from boundary crossing mechanics.)*
105
 
106
  ### 4. Overall (Mixed-Script)
107
+ | Tokenizer | Tokens | TWR | Chr/Tok | % Reduction |
108
+ |---|---|---|---|---|
109
+ | **SGPE(WWHO)** | **27,327,989** | **1.240** | **4.47** | **-** |
110
+ | OpenAI (o200k_base) | 43,174,798 | 1.959 | 2.83 | 36.7% |
111
+ | Llama 4 Scout | 45,236,671 | 2.053 | 2.70 | 39.6% |
112
+ | DeepSeek V3 | 68,739,586 | 3.119 | 1.78 | 60.2% |
113
 
114
  - **Zero-Breakage Guarantee**: Validated through exhaustive testing permutations across all supported Abugida scripts (0 violations).
115
  - **Full-corpus reconstruction**: 1.5M code-switched sentences encoded and decoded with 0 non-UNK mismatches.