G2P Multilingual ByT5 (12 layers) - IPA CHILDES - Espeak-ng
This model is a sequence-to-sequence model based on Google's ByT5, fine-tuned on the IPA CHILDES (split) dataset to convert grapheme to phonemes over a context of 512 tokens for 31 languages. The target phonemes were generated by the espeak-ng backend.
ByT5 is a tokenizer-free version of Google's T5 and generally follows the architecture of MT5.
ByT5 was only pre-trained on mC4 excluding any supervised training with an average span-mask of 20 UTF-8 characters.
Language tags
The following language tags can be used for prefixing the model input:
| Tag | Language |
|---|---|
| ca-ES | Catalan |
| cy-GB | Welsh |
| da-DK | Danish |
| de-DE | German |
| en-US | English (North America) |
| en-GB | English (United Kingdom) |
| es-ES | Spanish |
| et-EE | Estonian |
| eu-ES | Basque |
| fa-IR | Farsi/Persian |
| fr-FR | French |
| ga-IE | Irish |
| hr-HR | Croatian |
| hu-HU | Hungarian |
| id-ID | Indonesian |
| is-IS | Icelandic |
| it-IT | Italian |
| ja-JP | Japanese |
| ko-KR | Korean |
| nb-NO | Norwegian |
| nl-NL | Dutch |
| pl-PL | Polish |
| pt-BR | Portuguese (Brazil) |
| pt-PT | Portuguese |
| qu-PE | Quechua |
| ro-RO | Romanian |
| sr-RS | Serbian |
| sv-SE | Swedish |
| tr-TR | Turkish |
| yue-CN | Cantonese |
| zh-CN | Chinese |
The tag must be prepended to the prompt as a prefix using the format <{tag}>: (e.g., <pt-BR>: ).
Note: a space between the prefix colon (:) and the beginning of the text is mandatory.
Example 1: inference with tokenizer
For batched inference & training it is recommended using a tokenizer class for handling padding, truncation and additional tokens:
from transformers import T5ForConditionalGeneration, AutoTokenizer
model = T5ForConditionalGeneration.from_pretrained('fdemelo/g2p-mbyt5-12l-ipa-childes-espeak')
tokenizer = AutoTokenizer.from_pretrained('fdemelo/g2p-mbyt5-12l-ipa-childes-espeak')
model_inputs = tokenizer(["<en-US>: Life is like a box of chocolates."], max_length=512, padding=True, truncation=True, add_special_tokens=False, return_tensors="pt")
preds = model.generate(**model_inputs, num_beams=1, max_length=512) # We do not find beam search helpful. Greedy decoding is enough.
phones = tokenizer.batch_decode(preds.tolist(), skip_special_tokens=True)
print(phones)
# ['laɪf ɪz laɪk ɐ bɑːks ʌv tʃɑːkləts']
Example 2: inference without tokenizer
For standalone inference, the decoding without the tokenizer reads as
import torch
import json
from transformers import T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained('fdemelo/g2p-mbyt5-12l-ipa-childes-espeak')
input_ids = torch.tensor([list("<en-US>: Life is like a box of chocolates.".encode("utf-8"))]) + 3 # add shift to account for special tokens <pad>, </s>, <unk>
preds = model.generate(input_ids=input_ids, num_beams=1, max_length=512)
# Simplified version of the decoding process (discarding special/added tokens)
with open("tokenizer_config.json", "r") as f:
added_tokens = json.load(f).get("added_tokens_decoder", {})
phone_bytes = [
bytes([token - 3]) for token in preds[0].tolist() if str(token) not in added_tokens
]
phones = b''.join(phone_bytes).decode("utf-8", errors="ignore")
print(phones)
# 'laɪf ɪz laɪk ɐ bɑːks ʌv tʃɑːkləts'
- Downloads last month
- 49