Update README.md
Browse files
README.md
CHANGED
|
@@ -78,38 +78,38 @@ Sinhala and Devanagari serve as the high-complexity proofs-of-concept. The same
|
|
| 78 |
We evaluated WWHO against frontier models across a 1.5 million sentence code-switched corpus containing Sinhala, Hindi (Devanagari), and English.
|
| 79 |
|
| 80 |
### 1. Sinhala Efficiency
|
| 81 |
-
|Tokenizer
|
| 82 |
-
|---------------
|
| 83 |
-
|**SGPE(WWHO)
|
| 84 |
-
|OpenAI (o200k_base)
|
| 85 |
-
|Llama 4 Scout
|
| 86 |
-
|DeepSeek V3
|
| 87 |
|
| 88 |
### 2. Hindi (Devanagari) Efficiency
|
| 89 |
-
|Tokenizer
|
| 90 |
-
|---------------
|
| 91 |
-
|**SGPE(WWHO)
|
| 92 |
-
|OpenAI (o200k_base)
|
| 93 |
-
|Llama 4 Scout
|
| 94 |
-
|DeepSeek V3
|
| 95 |
|
| 96 |
### 3. English
|
| 97 |
-
|Tokenizer
|
| 98 |
-
|---------------
|
| 99 |
-
|**SGPE(WWHO)
|
| 100 |
-
|OpenAI (o200k_base)
|
| 101 |
-
|Llama 4 Scout
|
| 102 |
-
|DeepSeek V3
|
| 103 |
|
| 104 |
*(Note: Because WWHO routes Latin text directly to the native Tiktoken sequence, English performance is mathematically identical. The minor delta in total tokens emerges solely from boundary crossing mechanics.)*
|
| 105 |
|
| 106 |
### 4. Overall (Mixed-Script)
|
| 107 |
-
|Tokenizer
|
| 108 |
-
|---------------
|
| 109 |
-
|**SGPE(WWHO)
|
| 110 |
-
|OpenAI (o200k_base)
|
| 111 |
-
|Llama 4 Scout
|
| 112 |
-
|DeepSeek V3
|
| 113 |
|
| 114 |
- **Zero-Breakage Guarantee**: Validated through exhaustive testing permutations across all supported Abugida scripts (0 violations).
|
| 115 |
- **Full-corpus reconstruction**: 1.5M code-switched sentences encoded and decoded with 0 non-UNK mismatches.
|
|
|
|
| 78 |
We evaluated WWHO against frontier models across a 1.5 million sentence code-switched corpus containing Sinhala, Hindi (Devanagari), and English.
|
| 79 |
|
| 80 |
### 1. Sinhala Efficiency
|
| 81 |
+
| Tokenizer | Tokens | TWR | Chr/Tok | % Reduction |
|
| 82 |
+
|---|---|---|---|---|
|
| 83 |
+
| **SGPE(WWHO)** | **6,654,288** | **1.274** | **4.83** | **-** |
|
| 84 |
+
| OpenAI (o200k_base) | 17,360,196 | 3.324 | 1.85 | 61.7% |
|
| 85 |
+
| Llama 4 Scout | 18,157,707 | 3.476 | 1.77 | 63.4% |
|
| 86 |
+
| DeepSeek V3 | 29,152,698 | 5.581 | 1.10 | 77.2% |
|
| 87 |
|
| 88 |
### 2. Hindi (Devanagari) Efficiency
|
| 89 |
+
| Tokenizer | Tokens | TWR | Chr/Tok | % Reduction |
|
| 90 |
+
|---|---|---|---|---|
|
| 91 |
+
| **SGPE(WWHO)** | **13,433,554** | **1.181** | **4.29** | **-** |
|
| 92 |
+
| OpenAI (o200k_base) | 18,394,075 | 1.617 | 3.13 | 27.0% |
|
| 93 |
+
| Llama 4 Scout | 19,566,121 | 1.720 | 2.94 | 31.3% |
|
| 94 |
+
| DeepSeek V3 | 31,682,218 | 2.786 | 1.82 | 57.6% |
|
| 95 |
|
| 96 |
### 3. English
|
| 97 |
+
| Tokenizer | Tokens | TWR | Chr/Tok | % Reduction |
|
| 98 |
+
|---|---|---|---|---|
|
| 99 |
+
| **SGPE(WWHO)** | **7,240,147** | **1.330** | **4.46** | **-** |
|
| 100 |
+
| OpenAI (o200k_base) | 7,420,527 | 1.364 | 4.35 | 2.4% |
|
| 101 |
+
| Llama 4 Scout | 7,512,843 | 1.381 | 4.30 | 3.6% |
|
| 102 |
+
| DeepSeek V3 | 7,904,670 | 1.453 | 4.09 | 8.4% |
|
| 103 |
|
| 104 |
*(Note: Because WWHO routes Latin text directly to the native Tiktoken sequence, English performance is mathematically identical. The minor delta in total tokens emerges solely from boundary crossing mechanics.)*
|
| 105 |
|
| 106 |
### 4. Overall (Mixed-Script)
|
| 107 |
+
| Tokenizer | Tokens | TWR | Chr/Tok | % Reduction |
|
| 108 |
+
|---|---|---|---|---|
|
| 109 |
+
| **SGPE(WWHO)** | **27,327,989** | **1.240** | **4.47** | **-** |
|
| 110 |
+
| OpenAI (o200k_base) | 43,174,798 | 1.959 | 2.83 | 36.7% |
|
| 111 |
+
| Llama 4 Scout | 45,236,671 | 2.053 | 2.70 | 39.6% |
|
| 112 |
+
| DeepSeek V3 | 68,739,586 | 3.119 | 1.78 | 60.2% |
|
| 113 |
|
| 114 |
- **Zero-Breakage Guarantee**: Validated through exhaustive testing permutations across all supported Abugida scripts (0 violations).
|
| 115 |
- **Full-corpus reconstruction**: 1.5M code-switched sentences encoded and decoded with 0 non-UNK mismatches.
|