G2P Multilingual ByT5 (12 layers) - IPA CHILDES - Espeak-ng

This model is a sequence-to-sequence model based on Google's ByT5, fine-tuned on the IPA CHILDES (split) dataset to convert grapheme to phonemes over a context of 512 tokens for 31 languages. The target phonemes were generated by the espeak-ng backend.

ByT5 is a tokenizer-free version of Google's T5 and generally follows the architecture of MT5.

ByT5 was only pre-trained on mC4 excluding any supervised training with an average span-mask of 20 UTF-8 characters.

Language tags

The following language tags can be used for prefixing the model input:

Tag Language
ca-ES Catalan
cy-GB Welsh
da-DK Danish
de-DE German
en-US English (North America)
en-GB English (United Kingdom)
es-ES Spanish
et-EE Estonian
eu-ES Basque
fa-IR Farsi/Persian
fr-FR French
ga-IE Irish
hr-HR Croatian
hu-HU Hungarian
id-ID Indonesian
is-IS Icelandic
it-IT Italian
ja-JP Japanese
ko-KR Korean
nb-NO Norwegian
nl-NL Dutch
pl-PL Polish
pt-BR Portuguese (Brazil)
pt-PT Portuguese
qu-PE Quechua
ro-RO Romanian
sr-RS Serbian
sv-SE Swedish
tr-TR Turkish
yue-CN Cantonese
zh-CN Chinese

The tag must be prepended to the prompt as a prefix using the format <{tag}>: (e.g., <pt-BR>: ). Note: a space between the prefix colon (:) and the beginning of the text is mandatory.

Example 1: inference with tokenizer

For batched inference & training it is recommended using a tokenizer class for handling padding, truncation and additional tokens:

from transformers import T5ForConditionalGeneration, AutoTokenizer

model = T5ForConditionalGeneration.from_pretrained('fdemelo/g2p-mbyt5-12l-ipa-childes-espeak')
tokenizer = AutoTokenizer.from_pretrained('fdemelo/g2p-mbyt5-12l-ipa-childes-espeak')

model_inputs = tokenizer(["<en-US>: Life is like a box of chocolates."], max_length=512, padding=True, truncation=True, add_special_tokens=False, return_tensors="pt")
preds = model.generate(**model_inputs, num_beams=1, max_length=512) # We do not find beam search helpful. Greedy decoding is enough. 
phones = tokenizer.batch_decode(preds.tolist(), skip_special_tokens=True)
print(phones)
# ['laɪf ɪz laɪk ɐ bɑːks ʌv tʃɑːkləts']

Example 2: inference without tokenizer

For standalone inference, the decoding without the tokenizer reads as

import torch
import json
from transformers import T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained('fdemelo/g2p-mbyt5-12l-ipa-childes-espeak')
input_ids = torch.tensor([list("<en-US>: Life is like a box of chocolates.".encode("utf-8"))]) + 3  # add shift to account for special tokens <pad>, </s>, <unk>
preds = model.generate(input_ids=input_ids, num_beams=1, max_length=512)
# Simplified version of the decoding process (discarding special/added tokens)
with open("tokenizer_config.json", "r") as f:
    added_tokens = json.load(f).get("added_tokens_decoder", {})
phone_bytes = [
    bytes([token - 3]) for token in preds[0].tolist() if str(token) not in added_tokens
]
phones = b''.join(phone_bytes).decode("utf-8", errors="ignore")
print(phones)
# 'laɪf ɪz laɪk ɐ bɑːks ʌv tʃɑːkləts'
Downloads last month
49
Safetensors
Model size
12M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fdemelo/g2p-mbyt5-12l-ipa-childes-espeak

Quantizations
1 model