G2P Multilingual ByT5 (12 layers) - IPA CHILDES - Espeak-ng

This model is a sequence-to-sequence model based on Google's ByT5, fine-tuned on the IPA CHILDES (split) dataset to convert grapheme to phonemes over a context of 512 tokens for 31 languages. The target phonemes were generated by the espeak-ng backend.

ByT5 is a tokenizer-free version of Google's T5 and generally follows the architecture of MT5.

ByT5 was only pre-trained on mC4 excluding any supervised training with an average span-mask of 20 UTF-8 characters.

Language tags

The following language tags can be used for prefixing the model input:

Tag	Language
ca-ES	Catalan
cy-GB	Welsh
da-DK	Danish
de-DE	German
en-US	English (North America)
en-GB	English (United Kingdom)
es-ES	Spanish
et-EE	Estonian
eu-ES	Basque
fa-IR	Farsi/Persian
fr-FR	French
ga-IE	Irish
hr-HR	Croatian
hu-HU	Hungarian
id-ID	Indonesian
is-IS	Icelandic
it-IT	Italian
ja-JP	Japanese
ko-KR	Korean
nb-NO	Norwegian
nl-NL	Dutch
pl-PL	Polish
pt-BR	Portuguese (Brazil)
pt-PT	Portuguese
qu-PE	Quechua
ro-RO	Romanian
sr-RS	Serbian
sv-SE	Swedish
tr-TR	Turkish
yue-CN	Cantonese
zh-CN	Chinese

The tag must be prepended to the prompt as a prefix using the format <{tag}>: (e.g., <pt-BR>: ). Note: a space between the prefix colon (:) and the beginning of the text is mandatory.

Example 1: inference with tokenizer

For batched inference & training it is recommended using a tokenizer class for handling padding, truncation and additional tokens:

from transformers import T5ForConditionalGeneration, AutoTokenizer

model = T5ForConditionalGeneration.from_pretrained('fdemelo/g2p-mbyt5-12l-ipa-childes-espeak')
tokenizer = AutoTokenizer.from_pretrained('fdemelo/g2p-mbyt5-12l-ipa-childes-espeak')

model_inputs = tokenizer(["<en-US>: Life is like a box of chocolates."], max_length=512, padding=True, truncation=True, add_special_tokens=False, return_tensors="pt")
preds = model.generate(**model_inputs, num_beams=1, max_length=512) # We do not find beam search helpful. Greedy decoding is enough. 
phones = tokenizer.batch_decode(preds.tolist(), skip_special_tokens=True)
print(phones)
# ['laɪf ɪz laɪk ɐ bɑːks ʌv tʃɑːkləts']

Example 2: inference without tokenizer

For standalone inference, the decoding without the tokenizer reads as

import torch
import json
from transformers import T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained('fdemelo/g2p-mbyt5-12l-ipa-childes-espeak')
input_ids = torch.tensor([list("<en-US>: Life is like a box of chocolates.".encode("utf-8"))]) + 3  # add shift to account for special tokens <pad>, </s>, <unk>
preds = model.generate(input_ids=input_ids, num_beams=1, max_length=512)
# Simplified version of the decoding process (discarding special/added tokens)
with open("tokenizer_config.json", "r") as f:
    added_tokens = json.load(f).get("added_tokens_decoder", {})
phone_bytes = [
    bytes([token - 3]) for token in preds[0].tolist() if str(token) not in added_tokens
]
phones = b''.join(phone_bytes).decode("utf-8", errors="ignore")
print(phones)
# 'laɪf ɪz laɪk ɐ bɑːks ʌv tʃɑːkləts'

Downloads last month: 49

Safetensors

Model size

12M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fdemelo/g2p-mbyt5-12l-ipa-childes-espeak

Quantizations

1 model