YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
BPE based tokenizer used for the MEHDIE project and the training of a bilingual BERT model.
Vocabulary size: 52000 Trained on:
- Arabic dataset: https://huggingface.co/datasets/bigscience-data/roots_ar_openiti_proc
- Hebrew/English dataset: https://huggingface.co/datasets/mehdie/sefaria
Examples: Hebrew:
- "ืื ืืกืคืจ ืืืืืจ ืืืืจืื ืฉืกืคืจ ืืืฉ ืืื ืืืจืฅ ื ืืืจื ืฉืฉืื ืจืื ืื ืืืื ืืจ ืืื ื ืืืืืืื. ืืืื ืืืื ืืืื ืืืจืฆืืช ืจืืืช ืืจืืืงืืช ืืืฉืจ ืืชืคืจืฉ ืืืืจืื ืืื ืืืื ืืงืื ืฉืื ืื ืืชื ืื ืืืืจืื ืฉืจืื ืื ืฉืฉืืข ืืคื ืื ืฉื ืืืช ืืฉืจ ื ืฉืืขื ืืืจืฅ ืกืคืจื: ืืื ืืื ืืืืจ ืืงืฆืช ืืืืืืื ืืื ืฉืืืื ืฉืืืงืฆืช ืืงืืืืช ืืืฉืื ืืืื ืืืจืื ืืื ืขืื ืืืจืฅ ืงืฉืืืืื ืืฉื ืช ืชืชืงื"
- {'input_ids': [1060, 15784, 20958, 31767, 476, 4398, 3294, 1812, 19949, 42648, 455, 38010, 2069, 23008, 978, 11894, 3509, 8222, 973, 26, 23816, 8043, 461, 19170, 2998, 6517, 4245, 960, 5536, 928, 4122, 1008, 2643, 16456, 2702, 10350, 1796, 3044, 1333, 1488, 1019, 5501, 15530, 1109, 26822, 8473, 11437, 5419, 1919, 467, 13163, 6566, 4398, 454, 38, 7922, 1203, 41248, 9907, 21722, 1001, 16464, 931, 1123, 9907, 9647, 1053, 3044, 4553, 3573, 2851, 4088, 9330, 3492, 18352, 1057, 23994, 32635, 463], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Arabic:
- "ุณูุณูุฉ ุงูุฃุฌุฒุงุก ูุงููุชุจ ุงูุญุฏูุซูุฉ ุงูููุงุฆุฏ ูุงูุฃุฎุจุงุฑ ูุงูุญูุงูุงุช ุนู ุงูุดุงูุนู ูุญุงุชู ุงูุฃุตู ูู ุนุฑูู ุงููุฑุฎู ูุบูุฑูู ููู ุญุฏุซ ุงููููู ุฃุจู ุนูู ุงูุญุณู ุจู ุงูุญุณูู ุจู ุญู ูุงู ุงููู ุฐุงูู ุงูุดุงูุนู ุฏุฑุงุณุฉ ูุชุญููู ูุชุนููู ุงูุทุจุนุฉ ุงูุฃููู ุงูุฌุฒุก ุงูุฃูู ู ู ุงูููุงุฆุฏ ูุงูุฃุฎุจุงุฑ ูุงูุญูุงูุงุช ุนู ุงูุดุงูุนู ูุญุงุชู ุงูุฃุตู ูู ุนุฑูู ุงููุฑุฎู ูุบูุฑูู ุฑุถู ุงููู ุนููู ุฑูุงูุฉ"
- {'input_ids': [27193, 15595, 34780, 1361, 949, 13852, 21459, 2169, 30440, 896, 2040, 41252, 9723, 50442, 16317, 3057, 1675, 1216, 3320, 958, 910, 1260, 888, 1532, 888, 912, 935, 13333, 2040, 36093, 22637, 49937, 16554, 2254, 4572, 1576, 890, 13852, 21459, 2169, 30440, 896, 2040, 41252, 9723, 50442, 16317, 3057, 1432, 904, 2710, 1933], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
English:
- "The medieval Arabic name of the northernmost of the three provinces of the Jazira, the other two being Diyar Mudar and Diyar Rabi'a"
- {'input_ids': [2034, 16522, 4490, 1270, 22040, 1837, 2340, 7960, 1183, 989, 10048, 2068, 90, 13377, 1183, 989, 8235, 14261, 1021, 7322, 1183, 989, 54, 18017, 17311, 24, 989, 3249, 5269, 8500, 48, 17821, 1294, 57, 3307, 1294, 1261, 48, 17821, 1294, 26438, 85, 19, 77], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support