Transformers.js documentation

tokenizers

Transformers.js

You are viewing main version, which requires installation from source. If you'd like regular npm install, checkout the latest stable version (v3.8.1).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

tokenizers

Tokenization utilities

tokenizers
- static
  - .PreTrainedTokenizer
    - new PreTrainedTokenizer(tokenizerJSON, tokenizerConfig)
    - instance
      - .convert_tokens_to_ids(tokens) ⇒ any
      - ._call(text, options) ⇒ BatchEncoding
      - ._encode_text(text) ⇒ Array | null
      - .tokenize(text, options) ⇒ Array
      - .encode(text, options) ⇒ Array
      - .batch_decode(batch, decode_args) ⇒ Array
      - .decode(token_ids, [decode_args]) ⇒ string
      - .decode_single(token_ids, decode_args) ⇒ string
      - .get_chat_template(options) ⇒ string
      - .apply_chat_template(conversation, options) ⇒ string | Tensor | Array | Array | BatchEncoding
    - static
      - .from_pretrained(pretrained_model_name_or_path, options) ⇒ Promise.<PreTrainedTokenizer>
  - .loadTokenizer(pretrained_model_name_or_path, options) ⇒ Promise.<Array>
  - .prepareTensorForDecode(tensor) ⇒ Array
  - ._build_translation_inputs(self, raw_inputs, tokenizer_options, generate_kwargs) ⇒ Object
- inner
  - ~PretrainedTokenizerOptions : PretrainedOptions
  - ~Message : Object
  - ~BatchEncoding : Array | Array | Tensor

tokenizers.PreTrainedTokenizer

Kind: static class of tokenizers

.PreTrainedTokenizer
- new PreTrainedTokenizer(tokenizerJSON, tokenizerConfig)
- instance
  - .convert_tokens_to_ids(tokens) ⇒ any
  - ._call(text, options) ⇒ BatchEncoding
  - ._encode_text(text) ⇒ Array | null
  - .tokenize(text, options) ⇒ Array
  - .encode(text, options) ⇒ Array
  - .batch_decode(batch, decode_args) ⇒ Array
  - .decode(token_ids, [decode_args]) ⇒ string
  - .decode_single(token_ids, decode_args) ⇒ string
  - .get_chat_template(options) ⇒ string
  - .apply_chat_template(conversation, options) ⇒ string | Tensor | Array | Array | BatchEncoding
- static
  - .from_pretrained(pretrained_model_name_or_path, options) ⇒ Promise.<PreTrainedTokenizer>

new PreTrainedTokenizer(tokenizerJSON, tokenizerConfig)

Create a new PreTrainedTokenizer instance.

Param	Type	Description
tokenizerJSON	`Object`	The JSON of the tokenizer.
tokenizerConfig	`Object`	The config of the tokenizer.

preTrainedTokenizer.convert_tokens_to_ids(tokens) ⇒ any

Converts a token string (or a sequence of tokens) into a single integer id (or a sequence of ids), using the vocabulary.

Kind: instance method of PreTrainedTokenizer
Returns: any - The token id or list of token ids.

Param	Type	Description
tokens	`T`	One or several token(s) to convert to token id(s).

preTrainedTokenizer._call(text, options) ⇒ BatchEncoding

Encode/tokenize the given text(s).

Kind: instance method of PreTrainedTokenizer
Returns: BatchEncoding - Object to be passed to the model.

Param	Type	Default	Description
text	`string` \| `Array`		The text to tokenize.
options	`Object`		An optional object containing the following properties:
[options.text_pair]	`string` \| `Array`	`null`	Optional second sequence to be encoded. If set, must be the same type as text.
[options.padding]	`boolean` \| `'max_length'`	`false`	Whether to pad the input sequences.
[options.add_special_tokens]	`boolean`	`true`	Whether or not to add the special tokens associated with the corresponding model.
[options.truncation]	`boolean`		Whether to truncate the input sequences.
[options.max_length]	`number`		Maximum length of the returned list and optionally padding length.
[options.return_tensor]	`boolean`	`true`	Whether to return the results as Tensors or arrays.
[options.return_token_type_ids]	`boolean`		Whether to return the token type ids.

preTrainedTokenizer._encode_text(text) ⇒ Array | null

Encodes a single text using the preprocessor pipeline of the tokenizer.

Kind: instance method of PreTrainedTokenizer
Returns: Array | null - The encoded tokens.

Param	Type	Description
text	`string` \| `null`	The text to encode.

preTrainedTokenizer.tokenize(text, options) ⇒ Array

Converts a string into a sequence of tokens.

Kind: instance method of PreTrainedTokenizer
Returns: Array - The list of tokens.

Param	Type	Default	Description
text	`string`		The sequence to be encoded.
options	`Object`		An optional object containing the following properties:
[options.pair]	`string`		A second sequence to be encoded with the first.
[options.add_special_tokens]	`boolean`	`false`	Whether or not to add the special tokens associated with the corresponding model.

preTrainedTokenizer.encode(text, options) ⇒ Array

Encodes a single text or a pair of texts using the model’s tokenizer.

Kind: instance method of PreTrainedTokenizer
Returns: Array - An array of token IDs representing the encoded text(s).

Param	Type	Default	Description
text	`string`		The text to encode.
options	`Object`		An optional object containing the following properties:
[options.text_pair]	`string`	`null`	The optional second text to encode.
[options.add_special_tokens]	`boolean`	`true`	Whether or not to add the special tokens associated with the corresponding model.
[options.return_token_type_ids]	`boolean`		Whether to return token_type_ids.

preTrainedTokenizer.batch_decode(batch, decode_args) ⇒ Array

Decode a batch of tokenized sequences.

Kind: instance method of PreTrainedTokenizer
Returns: Array - List of decoded sequences.

Param	Type	Description
batch	`Array` \| `Tensor`	List/Tensor of tokenized input sequences.
decode_args	`Object`	(Optional) Object with decoding arguments.

preTrainedTokenizer.decode(token_ids, [decode_args]) ⇒ string

Decodes a sequence of token IDs back to a string.

Kind: instance method of PreTrainedTokenizer
Returns: string - The decoded string.
Throws:

Error If `token_ids` is not a non-empty array of integers.

Param	Type	Default	Description
token_ids	`Array` \| `Array` \| `Tensor`		List/Tensor of token IDs to decode.
[decode_args]	`Object`	`{}`
[decode_args.skip_special_tokens]	`boolean`	`false`	If true, special tokens are removed from the output string.
[decode_args.clean_up_tokenization_spaces]	`boolean`	`true`	If true, spaces before punctuations and abbreviated forms are removed.

preTrainedTokenizer.decode_single(token_ids, decode_args) ⇒ string

Decode a single list of token ids to a string.

Kind: instance method of PreTrainedTokenizer
Returns: string - The decoded string

Param	Type	Default	Description
token_ids	`Array` \| `Array`		List of token ids to decode
decode_args	`Object`		Optional arguments for decoding
[decode_args.skip_special_tokens]	`boolean`	`false`	Whether to skip special tokens during decoding
[decode_args.clean_up_tokenization_spaces]	`boolean`		Whether to clean up tokenization spaces during decoding. If null, the value is set to `this.decoder.cleanup` if it exists, falling back to `this.clean_up_tokenization_spaces` if it exists, falling back to `true`.

preTrainedTokenizer.get_chat_template(options) ⇒ string

Retrieve the chat template string used for tokenizing chat messages. This template is used internally by the apply_chat_template method and can also be used externally to retrieve the model’s chat template for better generation tracking.

Kind: instance method of PreTrainedTokenizer
Returns: string - The chat template string.

Param	Type	Default	Description
options	`Object`		An optional object containing the following properties:
[options.chat_template]	`string`	`null`	A Jinja template or the name of a template to use for this conversion. It is usually not necessary to pass anything to this argument, as the model's template will be used by default.
[options.tools]	`Array`		A list of tools (callable functions) that will be accessible to the model. If the template does not support function calling, this argument will have no effect. Each tool should be passed as a JSON Schema, giving the name, description and argument types for the tool. See our chat templating guide for more information.

preTrainedTokenizer.apply_chat_template(conversation, options) ⇒ string | Tensor | Array | Array | BatchEncoding

Converts a list of message objects with "role" and "content" keys to a list of token ids. This method is intended for use with chat models, and will read the tokenizer’s chat_template attribute to determine the format and control tokens to use when converting.

See here for more information.

Example: Applying a chat template to a conversation.

import { AutoTokenizer } from "@huggingface/transformers";

const tokenizer = await AutoTokenizer.from_pretrained("Xenova/mistral-tokenizer-v1");

const chat = [
  { "role": "user", "content": "Hello, how are you?" },
  { "role": "assistant", "content": "I'm doing great. How can I help you today?" },
  { "role": "user", "content": "I'd like to show off how chat templating works!" },
]

const text = tokenizer.apply_chat_template(chat, { tokenize: false });
// "<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]"

const input_ids = tokenizer.apply_chat_template(chat, { tokenize: true, return_tensor: false });
// [1, 733, 16289, 28793, 22557, 28725, 910, 460, 368, 28804, 733, 28748, 16289, 28793, 28737, 28742, 28719, 2548, 1598, 28723, 1602, 541, 315, 1316, 368, 3154, 28804, 2, 28705, 733, 16289, 28793, 315, 28742, 28715, 737, 298, 1347, 805, 910, 10706, 5752, 1077, 3791, 28808, 733, 28748, 16289, 28793]

Kind: instance method of PreTrainedTokenizer
Returns: string | Tensor | Array | Array | BatchEncoding - The tokenized output.

Param	Type	Default	Description
conversation	`Array`		A list of message objects with `"role"` and `"content"` keys, representing the chat history so far.
options	`Object`		An optional object containing the following properties:
[options.chat_template]	`string`	`null`	A Jinja template to use for this conversion. If this is not passed, the model's chat template will be used instead.
[options.tools]	`Array`		A list of tools (callable functions) that will be accessible to the model. If the template does not support function calling, this argument will have no effect. Each tool should be passed as a JSON Schema, giving the name, description and argument types for the tool. See our chat templating guide for more information.
[options.documents]	`Array.<Record>`		A list of dicts representing documents that will be accessible to the model if it is performing RAG (retrieval-augmented generation). If the template does not support RAG, this argument will have no effect. We recommend that each document should be a dict containing "title" and "text" keys. Please see the RAG section of the chat templating guide for examples of passing documents with chat templates.
[options.add_generation_prompt]	`boolean`	`false`	Whether to end the prompt with the token(s) that indicate the start of an assistant message. This is useful when you want to generate a response from the model. Note that this argument will be passed to the chat template, and so it must be supported in the template for this argument to have any effect.
[options.tokenize]	`boolean`	`true`	Whether to tokenize the output. If false, the output will be a string.
[options.padding]	`boolean`	`false`	Whether to pad sequences to the maximum length. Has no effect if tokenize is false.
[options.truncation]	`boolean`	`false`	Whether to truncate sequences to the maximum length. Has no effect if tokenize is false.
[options.max_length]	`number`		Maximum length (in tokens) to use for padding or truncation. Has no effect if tokenize is false. If not specified, the tokenizer's `max_length` attribute will be used as a default.
[options.return_tensor]	`boolean`	`true`	Whether to return the output as a Tensor or an Array. Has no effect if tokenize is false.
[options.return_dict]	`boolean`	`true`	Whether to return a dictionary with named outputs. Has no effect if tokenize is false.
[options.tokenizer_kwargs]	`Object`	`{}`	Additional options to pass to the tokenizer.

PreTrainedTokenizer.from_pretrained(pretrained_model_name_or_path, options) ⇒ Promise. < PreTrainedTokenizer >

Loads a pre-trained tokenizer from the given pretrained_model_name_or_path.

Kind: static method of PreTrainedTokenizer
Returns: Promise.<PreTrainedTokenizer> - A new instance of the PreTrainedTokenizer class.
Throws:

Error Throws an error if the tokenizer.json or tokenizer_config.json files are not found in the `pretrained_model_name_or_path`.

Param	Type	Description
pretrained_model_name_or_path	`string`	The path to the pre-trained tokenizer.
options	`PretrainedTokenizerOptions`	Additional options for loading the tokenizer.

tokenizers.loadTokenizer(pretrained_model_name_or_path, options) ⇒ Promise. < Array >

Loads a tokenizer from the specified path.

Kind: static method of tokenizers
Returns: Promise.<Array> - A promise that resolves with information about the loaded tokenizer.

Param	Type	Description
pretrained_model_name_or_path	`string`	The path to the tokenizer directory.
options	`PretrainedTokenizerOptions`	Additional options for loading the tokenizer.

tokenizers.prepareTensorForDecode(tensor) ⇒ Array

Helper function to convert a tensor to a list before decoding.

Kind: static method of tokenizers
Returns: Array - The tensor as a list.

Param	Type	Description
tensor	`Tensor`	The tensor to convert.

tokenizers._build_translation_inputs(self, raw_inputs, tokenizer_options, generate_kwargs) ⇒ Object

Helper function to build translation inputs for an NllbTokenizer or M2M100Tokenizer.

Kind: static method of tokenizers
Returns: Object - Object to be passed to the model.

Param	Type	Description
self	`PreTrainedTokenizer`	The tokenizer instance.
raw_inputs	`string` \| `Array`	The text to tokenize.
tokenizer_options	`Object`	Options to be sent to the tokenizer
generate_kwargs	`Object`	Generation options.

tokenizers~PretrainedTokenizerOptions : PretrainedOptions

Kind: inner typedef of tokenizers

tokenizers~Message : Object

Kind: inner typedef of tokenizers
Properties

Name	Type	Description
role	`string`	The role of the message (e.g., "user" or "assistant" or "system").
content	`string`	The content of the message.

tokenizers~BatchEncoding : Array | Array | Tensor

Holds the output of the tokenizer’s call function.

Kind: inner typedef of tokenizers
Properties

Name	Type	Description
input_ids	`BatchEncodingItem`	List of token ids to be fed to a model.
attention_mask	`BatchEncodingItem`	List of indices specifying which tokens should be attended to by the model.
[token_type_ids]	`BatchEncodingItem`	List of token type ids to be fed to a model.

Update on GitHub

←Models Processors→