YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
BoKenLm - Tibetan KenLM Language Model
A KenLM n-gram language model trained on Tibetan text, tokenized with BoSentencePiece.
Model Details
| Parameter | Value |
|---|---|
| Model Type | Modified Kneser-Ney 5-gram |
| Tokenizer | openpecha/BoSentencePiece (Unigram, 20k vocab) |
| Training Corpus | bo_corpus.txt |
| Pruning | 0 0 1 |
| Tokens | 38,497,401 |
| Vocabulary Size | 20,003 |
N-gram Statistics
| Order | Count | D1 | D2 | D3+ |
|---|---|---|---|---|
| 1 | 20,003 | 0.4380 | 0.4927 | 1.5624 |
| 2 | 6,649,657 | 0.6717 | 1.1476 | 1.5422 |
| 3 | 4,299,504 | 0.8465 | 1.2657 | 1.4805 |
| 4 | 3,477,865 | 0.9176 | 1.3860 | 1.5187 |
| 5 | 2,589,246 | 0.8776 | 1.4493 | 1.5850 |
Memory Estimates
| Type | MB | Details |
|---|---|---|
| probing | 375 | assuming -p 1.5 |
| probing | 457 | assuming -r models -p 1.5 |
| trie | 187 | without quantization |
| trie | 99 | assuming -q 8 -b 8 quantization |
| trie | 159 | assuming -a 22 array pointer compression |
| trie | 71 | assuming -a 22 -q 8 -b 8 array pointer compression and quantization |
Training Resources
| Metric | Value |
|---|---|
| Peak Virtual Memory | 12,333 MB |
| Peak RSS | 2,976 MB |
| Wall Time | 36.2s |
| User Time | 41.3s |
| System Time | 17.1s |
Usage
import kenlm
model = kenlm.Model("lm.arpa")
# Score a tokenized sentence
score = model.score("▁བོད་སྐད་ ▁ཀྱི་ ▁ཚིག་གྲུབ་ ▁འདི་ ▁ཡིན།")
print(score)
Files
lm.arpa— ARPA format language modelREADME.md— This model card
License
Apache 2.0
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support